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ABSTRACT 


This is the second scientific report of a study of the development 
techniques for the realization of ultrareliable, high-performance, space- 
borne computers. The techniques developed are in support of computer 
structures in which reliability is achieved through autonomously con- 
trolled logical reconfiguration and fault masking, A multiprocessor 
model is described that is particularly appropriate to the attainment 
of ultrareliability. Local design techniques, which facilitate recon- 
figuration, are discussed for various computer functions, including 
memory and microprogram control. Design techniques are presented for 
economical, fault- tolerant , data commutation networks. An initial 
effort is being directed toward a formal description of program-design 
techniques that will facilitate hardware diagnosis and, hopefully, yield 
mistake-free programs. 
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FOREWORD 


This is an interim report, summarizing work accomplished during the 
first six months of the second phase of a two year program, the goal of 
which is the development of techniques for the realization of ultra- 
reliable space computers. This study has been conducted in the Computer 
Techniques Laboratory of Stanford Research Institute, under the sponsor- 
ship of the Electronics Research Center of the National Aeronautics and 
Space Administration. 

The goals of the first phase were to survey the state of the art of 
design for achieving ultrareliable spaceborne computers, and to form a 
basis for research which would advance that art. The final report, which 
resulted from the first phase of the program, was concerned with the 
following: 

(1) The basic characteristics of an advanced spaceborne 
computer 

(2) A description of fault-masking techniques for general 
logic functions 

(3) A survey of codes for storage and arithmetic operations 

(4) Problems of system organization for dynamic error control 

(5) Tests for diagnosis of fault conditions 

(6) Some initial descriptions of network designs for a recon- 
figurable computer, including commutation or interconnec- 
tion networks, programmable processing modules and pro- 
grammable control units 

(7) Error-control techniques for memory systems 

(8) Distributed power supply systems 

(9) The application of magnetic logic 

(10) A survey of the published literature on the attainment of 
reliable systems through the use of redundancy. 
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The goal of the second phase is to develop detailed techniques for 
the logical design of an advanced, ultrareliable spaceborne computer. 

The techniques to be developed are to be used in support of computer 
structures in which reliability is achieved through autonomously- 
controlled logical reconfiguration and fault masking. In particular, 
those techniques have been developed by following a certain method of 
approach, which entails certain steps. The first step in the approach 
involves the development of a system organization that facilitates 
dynamic maintenance processes. On the basis of the selected system 
organization, a detailed logical design is then performed of networks 
that are uniquely appropriate for a reconf igurable computer. Thirdly, 
diagnostic procedures, reliability enhancement techniques and reliability 
analysis measures are developed for these networks, where the require- 
ment exists. The next step in the approach requires that software tech- 
niques be developed to aid in the diagnosis and detection of failures. 
Also, techniques must be developed for designing reliable programs. 
Finally, reliability analysis techniques are developed for the overall 
system. 

The report is organized into six chapters and one Appendix , The 
first chapter which serves as an overall introduction to the report, 
contains the statement of the problem; the goals, methods, and assumptions 
of the study; and a brief review of prior work on pertinent aspects of 
reliability enhancement; in addition to the organization of the report. 

Chapter II is concerned with the principles of multiprocessor sys- 
tem design that are particularly appropriate to the attainment of ultra- 
reliability. Chapter III contains logical design techniques for networks 
identified with the memory, control, and microprogram control functions. 
Chapter IV contains design details for commutation networks that are to 
perform the important function of data switching in a multiprocessor 
computer. Chapter V contains a formal description of program- design 
techniques, which will facilitate hardware diagnosis and which will, 
hopefully, yield mistake-free programs. Chapter VI contains a brief 
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description of other topics considered in the study, namely network 
diagnosis, and a survey of the pertinent literature, and also the con- 
clusions and a summary of our plans for the remainder of the program. 

The report is self-contained at least as far as the statement of 
principles is concerned. Detailed mathematical proofs and the descrip- 
tion of some hardware designs, particularly concerning commutation net- 
works and arithmetic processing elements have been deferred until publi- 
cation of the iinal report. The reader is referred to the first phase 
final report for detailed background information. 

The technical studies reported herein are the work of the following 
members of the Computer Techniques Laboratory: 

Mr. J. Goldberg 
Mr. M. W. Green 

Professor E. L. Lawler (University of Michigan-Summer 
employee of SRI) 

Dr. K. N. Levitt 

Dr. R. A. Short 

Dr. H. S. Stone 

Dr. J. B. Turner 

All of the individuals contributed to the writing of the various sections 
of the report. Mr. Goldberg who serves as Project Supervisor and 
Dr. K. N. Levitt who serves as Project Leader are responsible for the 
organization and editing of the report. 
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I INTRODUCTION 

In this chapter we discuss briefly and in general terms, the problem 
of realizing ultrareliabie spaceborne computers. Specifically, the chap- 
ter contains a discussion of the problem of designing computers that are 
appropriate to the characteristics of space missions, the statement of 
the program goals, a brief review of prior work on the attainment of ultra- 
reliability, and a brief summary of this report. 

A. Statement of the Problem 

In this section we consider the basic characteristics of space mis- 
sion computation, which impose severe constraints on the spaceborne 
computer design; and the conclusions, concerning the design, that are a 
consequence of these constraints.* The advanced spaceborne computer must 
respond to the following. 

1. Multiple- Problem Sets 

Several problems must be accommodated simultaneously, implying that 
a multiprocessing and/or a multiprogramming capability is required. 

2. Highly- Varied and Complex Computations 

It is anticipated that the scope of the mission will require a 
general scientific- type computer that will accommodate to a wide variety 
of sensors and output devices. 

3. Computations with a Range of Priorities 

It is convenient to assign to each mission computation three priority 
measures. The first of these is a critical value, which indicates the 
relative need for the existence of a particular computation, compared with 
the other computations at the moment. For example, certain launch 




References are listed at the end of the report. 
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computations are probably essential compared with certain ion-density 
computations. The second such measure is an accurac y value, reflecting 
the value attached to varying degrees of accuracy in the particular 
computation. For example, although a 2000-mile pass may be desired, a 
20,000-mile pass may still be tolerable if the closer pass simply cannot 
be attained. The third measure is an urgency value, which reflects the 
required speed with which a certain computation must be performed and 
hence, the amount of equipment that must be devoted to its execution. 
These characteristics imply one attractive approach, A computer may be 
designed which is capable of altering the logical interconnections among 
the computer components and, the tasks may be scheduled to match the 
available performance capability. Such an organization is, colloquially, 
said to embody reconfiguration with graceful degradation . 

4. Variable Capacity of the Earth- Vehicle 
Communication Link 

During certain phases of the mission, communication to the vehicle 
will not be possible, although there will be many instances when high 
speed communication — of a rate possibly exceeding 1 megabit/sec — will be 
quite feasible. Hence, the computer must be capable of autonomously 
carrying out diagnosis and repair routines during part of the mission, 
and conversely must be capable of responding to (and indeed taking ad- 
vantage of) external control information during other times. 

5. Failures Which May Be Transient or Permanent 

This fact is, of course, obvious, but the optimum technique, which 
will distinguish between these two fault types, is not at all obvious. 

One approach is to treat all failures as transient and effect a try-again 
procedure in response to all failures. 

6. Failures Which May Not Be Independent or May Not 
Embrace Single Components 

It is clear that the potentially high-stress space environment can 
result in a failure that will not be confined to a single component. 

For example, a radiation pulse could affect a sizable portion of the 
system, and a sudden unexpected acceleration could result in a fractured 
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chip. It has been convenient in the past to assume that failures (possibly 
embracing many elements as described above) occur one at a time. This 
assumption might be tenable if each failed subsystem is repaired immediately 
following the fault occurrence. One possible counterexample is the case 
wherein a system relies upon the switching of standby units to achieve ultra- 
reliability. It is required that such standby units either possess fault 
detection capabilities, or that they are diagnosed immediately prior to 
insertion in the system. 

7. Severe Constraints on Weight and Power 

Although the constraints on the weight and power of spaceborne equip- 
ment have been relaxed in recent years, it remains imperative to achieve 
a design which provides the maximum ratio of computing power per total 
equipment. This observation provides additional evidence for a 
multiprocessing-graceful degradation system, with a minimum amount of 
pure standby redundancy. 

8. Failures in Most Components 

Statistical reliability measures, comprehending various risk policies, 
have been shown to provide useful estimates of system performance. How- 
ever, since the variance of most semiconductor failure distributions is 
quite large, it is important to minimize the number of blocks for which 
a single component failure would disable the entire system. Essentially, 
it is important to incorporate redundancy into as much of the system as 

possible, even though, on a statistical basis, not much improvement is 

* 

realized by the Inclusion of such redundancy. 

B, Statement of Program Goals 

In recognition of the severe problem of designing ultrareliable 
spaceborne computers, NASA Electronics Research Center has set up the 
following study goals, the first three of which relate to the completed 


It is apparent that the design philosophy should reflect Murphy's Law. 
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first phase, and the latter four of which relate to the half-completed 
second phase: 

(1) To survey the state of the art of logical design of space- 
borne computers as it pertains to the enhancement of reli- 
ability 

(2) To conceive and evaluate new schemes of system design and 
operation that offer promise of advancing the state of 
the art 

(3) To recommend further directions of research that will aid 
in the improvement of present techniques 

(4) To investigate the organization of aerospace computers 
having high computational performance in which autonomously 
controlled logical reconfiguration of equipment and fault 
masking are employed for the purpose of achieving ultra- 
reliable operation. Derive specifications for classes of 
networks and processes that are appropriate to the realiza- 
tion of such organizations 

(5) To investigate the design of networks that realize the func- 
tions of general logic and commutation for reconfiguration 
for the systems derived in Item 4. These networks will 
incorporate various features of error control including 
fault masking, diagnosability , and reconfigurability. The 
design will employ criteria appropriate to advanced 
integrated-semiconductor array technology. Develop logical 
designs and techniques of analysis and synthesis appropriate 
to the particular networks designed. Develop criteria for 
the evaluation of the reliability of such networks. 

(6) To investigate the design of programs that facilitate the 
reconfiguration processes in systems of the type developed 
in Item 4 and that facilitate the flexible variation of 
computational performance with amount of available equip- 
ment. Propose the specification of requirements on 
executive programs that consider reconfiguration. Develop 
schemes for re-addressing and replacement of hardware con- 
trol by software subroutines. 

(7) To investigate techniques for the evaluation of the recon- 
figurable computers proposed in Item 4. Develop approaches 
to the construction of theoretical reliability models for 
such computers. 

C. Brief Review of Prior Work 

A significant effort has been devoted in the past decade toward 
solving various facets of the problem of realizing ultrareliable digital 
systems. This work, summarized in detail in the Phase I — Final Report, 1 
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ranges from investigations of fault-masking techniques for various com- 
puter sub-blocks, such as arithmetic processor and memory units, to studies 
of reliability-enhancement policies for large systems. It is felt that 
most of the problems pertaining to enhancing the reliability of isolated 
digital sub-blocks are now understood, at least from the standpoint of 
making sound engineering judgments concerning the use of the various 


techniques , 


Strictly passive redundancy techniques have been applied to 


the control and arithmetic processing sections of the Saturn IVb guidance 
computer, and it has been concluded 2 that the application of such tech- 
niques, exclusively, cannot economically satisfy the computation and 
reliability requirements of future spaceborne computers. 

The anticipation of this conclusion prompted an increased effort on 
the part of many organizations in the investigation of dynamic error- 
control mechanisms, in which the logical interconnections among the 
components of the computer may be altered. Theoretically this approach 
enables more efficient utilization of redundancy than the passive approach. 
In an attempt to substantiate this supposition, Avizienis , 3 ’ 4 et al. have 
been working on the construction of the JPL-STAR reconfigurable guidance 
computer, which embodies probably the simplest form of reconfiguration. 

This computer consists of a set of identical processing units, each with 
self-contained error-detection control by means of an arithmetic code, 
and a reliable magnetic power stepping switch, f which can connect power 
to any one of the units. IBM 5 is presently building a reconfigurable 
version of the Saturn IVb guidance computer. In these systems the recon- 
figuration is employed only at very high functional levels but it is 
well-known that there is potentially greater gain to be achieved by 
employing the reconfiguration at lower system levels. 


Three of the most promising so-called passive techniques discussed in 
Ref. 1 are the use of replicated- voting logic, error- correcting codes, 
and adaptive voting logic. 

^ In a forthcoming SRI report the operation of this power switch will be 
described. 
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Graceful-Degradation Systems which enable more efficient utilization 
of available equipment have been conceived and supposedly evaluated. 

Among the many references on this approach, in Ref, 6 a single processor 
structure is assumed, and in Refs. 7 and 8 a multiprocessor structure is 
postulated. In these studies as well as many others, several important 
items are not treated in depth, namely those relating to the following: 

(1) Diagnostic and replacement policies 

(2) Logical design techniques for memory, control, and 
processing units so that diagnosis and repair are facil- 
itated (or indeed feasible) 

(3) Reliable commutation (or data switching) required for the 
execution of subsystem replacement 

(4) The specification of software for the control of diagnosis 
and repair. 

When the new sources of failure that are introduced by the mechaniza- 
tion of the four items are included in the reliability analysis, it is not 
clear that the systems will perform as promised. This is an especially 
intrical problem when reconfigurability is extended to low-system levels, 
or when the capability for graceful degradation is provided, 

D. Brief Summary of Report 

The succeeding chapters represent an attempt to study in depth many 
of the detailed problems which must be investigated before a design for 
a reconfigurable spaceborne computer vith graceful degradation can be 
specified, or indeed even before an intelligent estimate of the feasibility 
of designing such a system can be formulated. 

In Chapter II, the system organization of a multiprocessing structure 
is discussed along with the particular characteristics of such a structure 
which facilitate reconfiguration and graceful degradation and the unique 
logical design problems attendant to the achievement of an ultrareliable 
multiprocessor. Maintenance policies, which reflect the computation and 
design constraints defined in Sec. I-A, are discussed and a flow descrip- 
tion is given, which points out the system response to various error, 
input and interrupt conditions. 




In Chapter III, we summarize an initial effort to perform a detailed 
logical design of the various functions associated with the selected com- 
puter organization. It appears that the reliability will be enhanced if 
some capability for repair is incorporated into the memory, processing, 
and control units. Concerning the processing units and possibly a portion 
of the control and memory units, the most attractive repair scheme is one 
for which the logical realization is a one-dimensional cascade of identical 
elements, whence the repair operation requires only the routing to a suc- 
ceeding element in the cascade of the signals destined for a faulty element 
We call such a logical realization a byte-sliced realization since it is 
natural to assign a byte (containing at present an undetermined number of 
bits) of each of the registers, adders, decoders, etc. to each element or 
slice in the cascade. A design of a byte-sliced microprogram control unit 
is described, and a memory module, which embraces byte slicing in addition 
to such reliability enhancement techniques as data channel coding and 
access switch-failure detection is described. A novel digital realization 
is given, which relates to the voting-switchover scheme described in de- 
tail in Ref. 1. 

Chapter IV is concerned with the design of commutation networks for 
the various data switching functions associated with a byte-sliced multi- 
processor. In particular, we consider permutation networks that, for 
example, route data from a selected memory to a selected control unit, 
and order-preserving networks that route data between the working (i.e., 
failure-free) byte slices of memory and control units. An important 
feature incorporated in the designs is the capability to accommodate for 
hardware failures in the commutation networks without requiring the dis- 
carding of the entire memory and processing units or even the byte slices 
served by the network. 

Chapter V is concerned with what reems to be an initial attempt to 
specify formal rules for the synthesis of programs in an ultrareliable 
computer. The description is in a general framework so as to embrace 
two important problems. The first of these is the synthesis of programs 
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which are mistake-free, wherein a mistake is assumed to be the result of 
human frailty, or of programs for which mistakes are readily detected and 
recovery to correct execution is possible. The second is the synthesis 
of programs that will facilitate the detection of hardware failures. We 
also consider the interrelationship between hardware and software, in 
particular, the ancillary hardware that will facilitate the specification 
of reliable software. 
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XI PRINCIPLES OF MULTIPROCESSOR SYSTEM DESIGN 


A. Introduction 

In Sec. I-A a brief list was presented of the characteristics of 
advanced spaceborne computation and the constraints on a computer design 
that are a consequence of these characteristics. It is observed — although 
clearly the observation is not original — that the multiprocessor organiza- 
tion provides an excellent match to the severe spaceborne computer design 
constraints. The basis for this conclusion is that in a multiprocessor 
structure: 

(1) The facility exists for simultaneous execution of several 
programs 

(2) It is possible to satisfy a wide range of computation 
time requirements by assigning a varying number of 
processors to a given task 

(3) It is possible to satisfy a wide range of accuracy ** 
requirements for the same reason as in (2) 

(4) Reconfiguration, to enhance reliability, is easily 
accommodated, at least at the processor level 

(5) Accommodation to different urgency levels is easily 
achieved by assigning a varying number of processors 
to a given task, although in this case each processor 
will operate with the sair.e set of data' 

(6) The data switching does not appear to be significantly 
more complex than would be expected for a single 
processor structure with extensive reconfiguration 
capability. 

Reference 1 concluded that the key problems of system design in 
achieving a reliable reconfigurable computer are flexibility of struc- 
ture, modularity, simplicity of diagnosis, and reliability of control. 


,r These terms were informally defined in Sec. I-A. 

^ It is implied that a set; of processors are operating in the replicated 
mode, ana the data outputs are in some way compared. 
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The first two problems, at least at the processor level, are clearly 
solved by the multiprocessor organization; the solution to the latter 
two problems must await the detailed study of maintenance policies and 
of the logical design of the processors. 

Many descriptions of multiprocessor systems have appeared in the 
literature , s * 9 * 10 and several contemporary computer systems 11 rely upon 
multiprocessing. Most of these previous descriptions have been concerned 
with (l) gross estimates of system reliability, assuming for example, 
that diagnosis and switchover are always executed correctly; (2) schedul- 
ing analyses and simulations to facilitate the determination of system 
responses to various inputs; and (3) the specification of software that 
will enable the optimum utilization of the hardware. Our intention in 
this program is to study maintenance policies and logical designs that 
will maximize the system reliability,^ and then to formulate a realistic 
estimate of system reliability, which will, hopefully, compare favorably 
with the previous gross estimates. 

In this chapter we discuss a possible multiprocessing system organiza- 
tion, considering those policies which apply particularly to error control 
and the functional requirements of the various blocks. 

B. Multiprocessor System Organization 

The multiprocessor model with which we are concerned, is depicted in 
Fig, II-l. This system consists of a set of M high-speed working memories 
(WM); a set of N simple processor and control units (SP); a set of Q 
arithmetic logic units (ALU); a set of H back-up memories (BAM); an input- 
output device controller (I/O) for which we provide sets of spare registers, 
counters, buffers, and real-time clocks; two corn-nutation networks (CN); a 


At present the "reliability of a multiprocessor" has not been formally 
defined, but we will temporarily assume on a qualitative basis that 
the reliability measure reflects the capability of the unit to carry 
out the set of mission computation tasks weighted according to priority, 
urgency, and accuracy. 
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FIG. 11-1 MULTIPROCESSOR COMPUTER SYSTEM BLOCK DESIGN 














supervisory control unit (SCU); and two registers for setting up the 
commutation networks (it is convenient to view these registers as form- 
ing a component of the SCU). 

In operation a given set of WM f s, SP*s, and ALU f s will be in com- 
munication by means of links established in the two commutation networks. 
In addition, the commutation networks* perform the function of directing 
data to the failure-free byte slices, thus, effectively "repairing” the 
units. 

It is envisioned that each SP unit will have the capability of 
executing comparatively simple decision and arithmetic algorithms, the 
capability of controlling program flow, and the capability of controlling 
processor allocation and scheduling.*^ An ALU will be used for the execu- 
tion of complex algorithms requiring extensive processing hardware. The 
SCU will function as a referee in all error-control processes, and, 
essentially, represents the system hard-core , although it can be super- 
ceded by a command from the ground. The BAM’s will store the task pro- 
grams, diagnostic programs, and the setup programs for the commutation 
networks. 

In addition to the inter-unit communication links provided by the 
commutation networks, a single data channel (probably serial), shown as 
bold-faced lines in Fig. II-l, is provided. This data channel links the 
SP and WM units with the supervisory control unit. It was noted by 
Alonso 10 that a complete multiprocessing system could be designed, con- 
taining only this single data-channel communication link, although in 
practice it is doubtful that this link would be serial. However, we have 
included the possibility of multiple-simultaneous communication between 
SP and WM blocks because of the additional flexibility thus provided, and 
because it seems appropriate, at this stage, to work with a general model. 


■ 8 * 

Detailed logical designs of commutation networks are presented in 
Chapter IV. 

«#• 

It is implied here that each processor can function as an executive. 
This "floating-executive" technique, which is developed with greatei 
detail in Sec. II-c, is also discussed in Ref. 12. 
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One important feature of the system, which is not shown explicitly 
in the figure, is that each defined block of the system will have at 

ft 

leas t one distinct power supply associated with it. Furthermore, it is 
assumed that the power can be disconnected from a faulty block without 
resulting in the propagation of errors into connecting blocks due to 
excessive loading on the part of the disconnected unit, 

C. Description of Error-Control Policies 

In this section we will describe a possible set of error-control 
policies by indicating in flow-form the system response to various input 
and error conditions. It should be recognized that the system will not 
function exactly according to this description, but, by reference to this 
description, we are provided with a reasonably complete set of functional 
requirements for the various blocks. 

In Table II-l we present the flow description, and in Tables I 1-2, 
and II-3 respectively, we summarize the functional requirements of the 
supervisory control unit and the simple processor and control unit per- 
forming the role of the executive. At the conclusion of Table 11*1, a 
set of comments are presented to clarify anomalies in the flow descrip- 
tion and to point out some instances where there is ^ome doubt concerning 
the optimality of the strategy selected. 



It was shown in Ref. 1 that the distribution of power supplies does 
not appreciably increase the weight or raw power requirements beyond 
that associated with the use of a single power supply. 
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Table II-l 

FLOW DESCRIPTION OF MULTIPROCESSOR 


1. SCU selects an SP (say SP,,), which is listed in the table of 

£ 

available SP T s, to function as executive. 

2. SCU instructs SP„ to access an available BAM. 

E 

2.1 If accessing capability of SP has failed, SCU selects 

E 

another executive, and SP is removed from the table of 

E 

available SP’s, and Step 2 is repeated. 

3. SP £ selects an available WM to serve as the executive working 


memory, WM^ 


4. SP retrieves from a small store, possibly associated with the 

E 

SCU, the setup data so that SP £ can access WM^,. 

5. Setup data is transferred to the setup register of CN^. 

6. A WM and SP diagnostic program is directed from BAM through 

SP E t0 WM E* 

6.1 If the diagnostic program indicates SP cannot function as 
an executive, another SP is chosen and Step 2 is repeated. 

6.2 If diagnosis of WM^ fails and WMg cannot be repaired, 
another executive memory is selected, and WM^ is removed 
from the available table. 

6.3 If WMg is repairable, SP E performs the task. 

7. The executive program is transferred from BAM through SP 

E 

to WM . 

E 

8. SP supervises diagnosis of the remaining set of SP’s. 

E 

9. SP selects a set of WM’s, SP*s and ALU’s to communicate with 

E 

each other. 

10. SP computes the setup data to effect this communication and 

E 

transfers setup data to the two setup registers. 

11. Each SP diagnoses its associated ALU and WM. 

11.1 If diagnosis fails and associated WM's and ALU's are not 
repairable, a different assignment is effected. 
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Table 11-1 (Continued) 


11.2 If the WM's and ALU's are repairable, the associated SP 
performs the task, 

12, SP responds to an input device requesting service. 

£ 

12.1 Assume urgency value of input is 1 (simplex): 

12 . 1.1 
12.1.2 


SP_ selects SP , 
E s t 


WM , ALU to handle program. 

s i s i 

Pertinent program is transferred from BAM through 


CP to WM . 
s x Si 

12.1.3 Input data is transferred to WM and also to WM . 

Si £ 

12.1.3.1 Results of intermediate calculations 

might be stored in WM (as well as in 

£ 

WM ) to provide a convenient roll-back 
si 

point in the event of a failure. 

12.1.4 Assume SP detects a computational error: 

si 

12.1.4.1 The computation is repeated, using the 
data and results stored in WM^, to de- 
termine if failure is transient or 
permanent . 

12.1.4.2 If error continues, the problem, data, 
and intermediate results are transferred 
to another set of SP, WM and ALU units 
for continuation of computation. 

12.1.4.2.1 SP is diagnosed and either 

Si 

retained or discarded. 

12.1.4.2.2 WM and ALU are either 

Si s x 

diagnosed by SP g (if it does 
not contain the failure) or 
retained for other tasks. 

12.1.4.3 If computation error continues with the 
replacement set of SP, WM, and ALU units, 
the error resides in the program or 
input data. 
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Table II-l (Continued) 


12.2 Assume urgency value of input is 2 (duplex): 

12.2.1 SP E selects the set of units (SP D , WMp , ALUp ) 
and (SPp , WNp , WM^ , ALU^ ) to handle program. 

12.2.2 Pertinent program and input data are transferred 
from BAM through SPp , SP^ to WMp , WMp . 

12.2.3 After each intermediate computation is completed, 

a change-commutation network command is directed 

to the SP . 

£ 

12.2.4 The assignment of CN 2 is altered so that SP Q 

and SP exchange working memories. 
i>2 

12.2.5 The results computed by the two sets of units are 
compared. 

12.2.5.1 The intermediate computation is repeated, 
commencing at the latest point in the 
program where the computations were in 
agreement . 

12.2.5.2 If the discrepancy ceases, then the fail- 
ure was transient. 

12.2.5.3 If the discrepancy continues, then the 
failure is immediately attributed to one 
of the sets of units (SP , WM n , ALU_ ), 

D i "i 

< s v *V "V- 

12.2.5.4 The faulty set is disconnected, and CN 2 

is altered by SP so that an available 
£ 

set of blocks are connected. 

12.2.5.5 The faulty units are then diagnosed. 

12.3 Assume urgency value of input is 3 (triplicated): 

12.3.1 SP E selects three sets of units (SP T , WM^ , ALU^ ) 
(SP T2> WM^, ALU^) and (SP^, WM /aLI^ ) to 
handle program. 

12.3.2 Pertinent program and input data are transferred 
to the three working memories. 
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Table II-l (Continued) 


12.3.3 After each intermediate computation is completed, 
the results obtained by each of the three sets of 
units are compared. 

12.3.4 If a unit’s results disagree with those computed by 
the remaining two units, the dissenting set is dis- 
connected and replaced by another set. 

Comments 

1. It is assumed throughout that the decisions of the executive 
are continuously checked by the SCU. For example, an exec- 
utive that consistently requests the diagnosis of the other 
SP f s would be disconnected from service and subsequently 
diagnosed. 

2. The status of the SCU can be periodically monitored by, for 
example, a ground station, whence the SCU can be discon- 
nected from service if it appears to be faulty. The ground 
station can then assume the ’Veto" power previously assigned 
to the SCU. 

3. It has not been assumed for a general system that only a 
single unit can fail during the period between the execution 
of diagnostic routines, although a system which contained 
only two SP units would probably be disabled by the occur- 
rence of simultaneous faults in each SP. 

4. In the description, it is implied that all units are diag- 
nosed immediately prior to insertion in the system. Although 
this policy would make a single-failure assumption seem more 
tenable, it is not strictly required since the system can 
accommodate a policy wherein diagnosis is deferred until a 
failure is detected, 

5. In Step 9 it is assumed that a sufficient quantity of WM’s, 
SP’s and ALU’s are available. If this is not the case, then 
several tasks might be executed with the same equipment in a 
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Table II-l (Concluded) 


multiprogrammed mode, or an SP unit might execute a program 
without recourse to an ALU. In addition, in order to satisfy 
stringent accuracy requirements, several ALU’s might be 
assigned to function with a single SP in the execution of a 
program. 

6, We have assumed that each task could be assigned one of three 
urgency values, corresponding to simplex , duplex and triplicated 
modes of operation. The simplex mode, wherein a processor de- 
tects its own faults, would be attractive for tasks that could 
be interrupted and for which the data and program code permit 
convenient roll-back to a known error-free state. The duplex 
mode, wherein two sets of units function simultaneously, would 
be used for interruptable tasks that require immediate error 
detection, but for which convenient roll-back is not possible, 
and for which an accurate record of all calculations must be 
available. The triplicated mode, wherein three (or possibly 
more) sets of units function simultaneously, would be used for 
the most critical mission tasks, 

7, The reliability status of the various SP, WM, and ALU blocks 
is stored in a table of available equipment, which can be 
conveniently considered to form a portion of the SCU. A fail- 
ure of this table could be circumvented by assuming that all 
units are faulty and whence each unit is diagnosed prior to 
insertion in the system. 
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Table 11-2 

FUNCTIONAL REQUIREMENTS OF SUPERVISORY CONTROL UNIT 


1. Store a table of available equipment. 

2. Select an SP to function as the executive, and direct 
the chosen SP to access a diagnostic program. 

3. Control the diagnosis of an SP that will function as 
the executive. 

4. Monitor all directives of the executive, but the SCU 
can only "veto" the commands of the executive. 

5. Respond to earth commands to disconnect itself from 
service. 


Table II-3 

FUNCTIONAL REQUIREMENTS OF EXECUTIVE 

1. Compute the set up data for the commutation networks. 

2. Respond to all input and interrupt commands. 

3. Select sets of SP's, WM’s and ALU's for various tasks. 

4. Organize the diagnosis and repair of units to be in- 
serted in system. 

5. Respond to the error and program status conditions of 
the other SP units. 
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D. Logical Design, Strategy, and Software Problems 
A ssociated with Multiprocessor-System Design 

As indicated in Table II-l , many detailed problems of design and 
analysis are critical to the functioning of the multiprocessor. Among 
these are the following: 

(1) A study of the simultaneous data transfer capacity required 
for the commutation networks. Of concern here are the 
trade-offs between rate of program execution and the com- 
plexity of both the commutation network and the setup 
algorithm. 

(2) The design of commutation networks that offer economy of 
design, ease of setup, ease of diagnosis, and a tolerance 
to failures in the sense that a failure in a portion of 
the commutation network should disable a minimum amount of 
processor and memory capability. (See Chapter IV) 

(3) The design of memory and arithmetic logic modules that 
offer combinations of fault masking and ease of reconfigura- 
tion. A discussion of such ALU’s is presented in Ref. 1, 
and the techniques of microprogram control for such units 

is discussed in Sec. III-C of this report. A description 
of the organization of a reliable memory module is given in 
Sec. III-B of this report. 

(4) The design of simple processor and control modules. It is 
envisioned that these modules would be of minimal complexity 
and would possess the capability of either controlling the 
flow of program data between a WM and an ALU, or perform 
the role of both controller and processor upon the occur- 
rence of an insufficient quantity of available ALU units. 

A convenient framework for such a module is contained in a 
paper by Frankel, 13 which discusses the minimum complexity 
required for a digital computer. Some reliability enhance- 
ment might be incorporated within the module, either in the 
form of reconfiguration capability at the register and 
adder-byte level, or in the form of fault masking for ir- 
regularly structured control functions. 

(5) The incorporation of protection for the back-up memories. 

If these memories are in the form of tapes, reasonably 
simple error-correction coding techniques can be applied 
for the protection of stored data. 


20 


(6) The synthesis of programs that are amenable to simplex 
type error detection, and the development of techniques 
which permit the detection of, and the recovery from, 
errors introduced by programming mistakes. (See 
Chapter V. ) 

(7) A reliability analysis of the overall multiprocessor system 
and an evaluation of the assumed maintenance policies in 
order to uncover a possible "optimum" set of strategies. 
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III TECHNIQUES OP LOGICAL DESIGN 

A. Introduction 

This portion of the report is concerned with schemes for the logical 
design of networks that realize major functions appropriate to a recon- 
figurable, reliable computer. The functions to which we have given the 
greatest attention thus far in the second phase of the study are memory, 
microprogram control, and adaptive voting. The important case of a a 
arithmetic- logical processor was examined in the first phase. The approach 
taken was to organize a processor as an iterative array of byte-oriented 
modules, called byte-slices. This approach, which appears to be quite 
powerful, is applicable to a number of special networks (e.g., in micro- 
program control units). Further development of processor networks will 
be examined in the coming period. 

B. The Organization of a Reliable Memory Module 

1 . Introduction 

Hie working memory of a computer comprises a natural organizational 
unit for the application of error control. It is a vital unit, accounting 
for a large fraction of the equipment of a modern computer, and its sim- 
plicity permits a wide range of error-control approaches; hence, it is of 
interest to develop effective and flexible design techniques appropriate 
for memory systems. 

In Appendix A of the Final Report — Phase I, 1 a number of techniques 
for error control in conventional working memories were reviewed. As a 
result, the following observations were made. First, it is generally 
impractical tc apply fault masking at the circuit level to the circuits 
Involved in memory selection and storage; hence, error-contrul techniques 
must be applied at the logical level. Secondly, there are several signif- 
icant sources of error, including data channels, word selection (access) 
circuits, basic timing and power circuits, and the logical schemes most 
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effective for error control for each source are quite different from one 
another. Furthermore, both transient and permanent errors are significant. 

Iu this section we will describe the organization of a memory module 
in which several kinds of error-control schemes, primarily logical, are 
incorporated. Only the basic schemes are presented. Further analysis 
is needed to determine the optimum allocation of redundancy in the several 
schemes. 

We assume a model that is a magnetic core memory. Thus, we include 
a power- decoding- switch (access switch), for word selection, a passive, 
destructive- read, recording medium, a set of power drivers for recording, 
a set of sense amplifiers, and associated timing and power-supply circuits. 
The techniques to be described allow for other kinds of memory, such as 
nondestructive- read, or active-element storage, or even some cyclic mem- 
ories, but we shall not attempt to provide full generality. 

Experience has indicated the need to consider the following error 
types as very significant: 

(1) Transient errors, primarily in the data- read channels, due 
to external noise or to internal, data-sensitive signal 
cross- coupling 

(2) Permanent, single-element, independent component failures 

(3) Permanent, multiple-element, nonindependent component 
failures (e.g., a cracked integrated circuit device) 

For the present we ignore the timing and power- supply circuits, since 
their design problems do not seem to be unique to memory systems. 

2. System Description 

The memory system, which we consider, is described with reference 
to Fig. III-l. It is a memory system of W words, each containing B bytes 
of data, where a byte contains D binary digits. In this system, the 
following redundancy schemes are incorporated. 

(l) The system is partitioned into B subsystems, each serving 
all W words. Each subsystem is an independent memory, 
containing its own access switch and drivers, store and 
data sense and drive amplifiers (in a complete design, 
separate subsystem power supplies and timing circuits 
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would also be appropriate). B is greater than the number 
of bytes needed for a minimally acceptable word, B^j hence, 
an order- preserving commutator network is provided to 
connect any subset of Bjj out of the B bytes to the memory 
system interface. 

(2) Incorporated in each store is a special circuit for deter- 
mining the number of words selected at a given cycle, 
quantized to the levels 0, 1, and more than 1. Reliable 
circuits for this determination are well known. For 
example, one may provide one or two storage elements 
(cores) per word, operated so as to switch on a standard 
word selection excitation, with a common three- level sense 
amplifier. This circuit is very useful in determining 
whether or not the access circuits are faulty, since by 
far the most common modes of failures within access 
switches result in either no output or in multiple word 
selection. 

This circuit is not absolutely essential since the cases 
of zero or multiple word selections could be inferred 
from analysis of the data channels. For example, if 
multiple parity-check redundancy is used (it is recom- 
mended in the next paragraph) , the code could be designed 
to have a large error- detecting capability, and most 
multiple selections would be detected. However, the 
redundancy required for such error detection would be 
better used for error correction. 

(3) The data of each subsystem are encoded with an error- 
correcting code. The most attractive codes for this 
purpose are those based on threshold decoding (see 
Sec. II, part B-2a of Final Report — Phase I 1 ) because 
most single faults within the decoder network for such 
codes are masked, and the codes are reasonably efficient. 
Furthermore, as discussed in the Final Report--Phase I, 
double-error correcting codes are far superior to single- 
error correcting codes with respect to reliability-weight 
tradeoffs. The disadvantage of the former lies in the 
doubled cost (per data bit) of the decoding network, but 
with the new LSI technology, this is not a serious 
disadvantage. 

An important consideration here is the problem of multiple 
transient errors due to environmental noise. The problem 
that arises is that if there are more than t + 1 errors 
in a word in a 2t + 1 error- correcting system, the 
resulting pattern may appear (falsely) as a valid symbol, 
or as a symbol with i or fewer bits different from a 
valid symbol. As discussed in the Appendix, 1 this 
problem is reasonably well solved by using the error- 
detecting capability of the code; thus, the error- 
correction logic should provide an alarm when more than 
£ bit errors are computed in a 2t + 1 error- correcting code. 
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(4) An associative memory is employed to provide relocation 

for words that become unusable. The primary failure source 
for which this scheme is intended is the access switch 
and its drivers, although it is also useful for the case 
in which only a small number of storage locations suffer 
more bit damage than can be accommodated by the error- 
correcting codes on the data channels. In the scheme, 
a block of words, say the last 2 a words in a 2 c -word 
memory, are reserved for relocated data. An associative 
memory with capacity for 2 a entries is driven in parallel 
with the main memory subsystems. If one of its entries 
is excited, indicating that the external computer is 
addressing a word that has been relocated, a substitute 
location number, or alias , is emitted. For economy of 
storage, only ja bits are needed for the alias since the 
high order digits may be provided by a constant (c - a)-bit 
number source; thus, the associative memory contains 2 a 
words, of c + a bits each. The high- and low-order digits 
are combined, and the resulting number replaces the 
original address in the memory access register. 

3 , Coordination of Error-Control Mo des 

Mode 1, partitioning with switch-over replacement, would be usei 
when the error-correcting capability of all lower-level schemes is ex- 
ceeded, It requires external diagnosis and control. Mode 2, detection 
of over- or under-selection, is important in preventing the data errors, 
due to access switch failures, being falsely corrected by the error- 
correcting logic in the data channels. Mode 3, error correction in the 
data channels, is the primary means of masking transient error. Mode 4, 
word-relocation, is important for accommodation of access-switch faults. 

It appears to be a valuable scheme, because, in commonly used switches, 
the average single fault results in the loss of only a small fraction of 
the outputs. The value of the scheme would be increased further, if the 
switch were designed so hhat the maximum fraction of outputs lost, due 
to any single switch fault, was limited. 

A reliability analysis of the total system is needed in order to 
determine the optimum allocation of redundancy among the various error- 
control modes. 
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C. Design Techniques for a Modular, Microprogrammed Control Unit 


1. Introduction 

The use of a microprogram (^program) structure for the control 
unit of a reliable computer is attractive because it simplifies the task 
of modifying the behavior of the con unit in order to accommodate 
system failures. In addition, the inhc ~*t modularity of the major part 
of the structure potentially allows the use of highly efficient forms of 
error control within the control unit itself. 

In this section we shall examine techniques for improving existing 
M'-program schemes in two ways that are significant to reliable computers; 
these are, increasing structural modularity, and increasing the effective- 
ness of reprogramming for system reconfiguration* The important problem 
of applying error control to the new schemes will be considered in sub- 
sequent work. 

In a M-program control unit, a control algorithm is represented by 
data called M-instructions. These data are usually treated as words 
stored in an addressable memory. In practice, this memory may be realized 
as a matrix of logic elements (e.g., diodes), and frequently logic opera- 
tions other than simple memory functions are employed within the matrix. 

In this study we assume that only memory functions are allowed. This 
assumption not only increases the generality of the results, but it per- 
mits the use of the main working memory of a computer as a back up for 
the fi-program data store. 

The basic components of a ji- instruction are as follows: 

(1) Specification of the active system data paths 

(2) Specification of the operating modes of functional units 

(3) Specification of the rules for selection of the next 

jjb-instruction. 

The third component is the source of most of the nonmodularity of struc- 
ture. In the next section we discuss various kinds of next- instruction 
rules, and means for increasing modularity of implementation. 
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2. Schemes for Selection of the Next Auction 




There are two aspects to the selection of the next ^-instruction. ] 

These two aspects, the composition of the address code and the generation ] 

I 

of test functions for branching- type ^instructions, will be discussed | 

separately, | 

1 

a. Composition of the Address Code 

The address code may be composed, using data stored explicitly 
in the ^instruction, or data stored in temporary data registers on coun- 
ters, which may be considered to be implicit in the ^instruction, or by 
some combination of explicit and implicit data. The explicit form has 
the advantage of speed, since i.o time is needed for loading special reg- 
isters, and the disadvantage of storage cost. There is a wide variety 
of useful schemes, employing combinations of implicit or explicit 

references , j 

It is convenient to distinguish four types of ^instructions, j 

nonbranching, unconditional branching, two-way branching, and multi-way 

I 

branching. j 

I 

The explicit form is feasible and commonly used for the first j 

i 

three of the four types. For two-way branching, providing memory space | 

in each ^-instruction for two explicit next- location fields may be very i 

extravagant if only a small fraction of the p- instructions are of this 
kind. 

The implicit form is appropriate for the first type, non- 
branching, ^instructions; the natural rule is to take the new location 
as the previous location plus one. The second type, unconditional 
branching, M-instructions require explicit data. In simple control units, 
the fraction of unconditional branching type ^-instructions may be low 
enough to justify placing the next- address data in a separate memory word 
so that a given word holds either external control information or next 
M-instruction information. 
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For the third typ^, two way branching, ^-instructions, one of 
the next-address codes be implicit (present address plus one). 

Another important scheme is to use two implicit addresses in a "skin" 
manner, as follows: 

first branch: next address = present address plus 2 

second branch; next address = present address plus 1. 

In the second case, the next ^instruction clearly must be an unconditional 
branch, the second type. Th . ’’present address” may also be explicit. 

For the fourth type, multi-way branching, ^-instructions , the 
’’skip” sequencing scheme may be generalized as follows: 

next address = present address plus N, 
where N is generated as a result of the test. 

All but one of the next ^instructions must, again, clearly be uncondi- 
tional branches. Again, the ’’present address” may be explicit. 

The foregoing descriptions illustrate the wide range of choices 
that are present in the design of the address-composition function. 
Inherently, all of them are consistent with a modular structure, since 
they require only the operations of register transfer, counting and 
addition. The particular choices require detailed examination of engi- 
neering trade-offs of speed and equipment cost within the context of a 
particular system. 

b. Generation of Test Functions for Branching- Type p-Instructions 

Let (x , x , . . . , x ) be the set of system status indicators 
i « n 

(arithmetic overflow, input/output requests, program symbols, etc.) 

There are several functions of these indicator variables that 

are of practical importance in the testing of system status. For the case 

of a binary test, i.e., for a two-way conditional-branching (X-instruction, 

the test function is a single variable, say f. This function, f, may be 

a general boolean function, but some functions are of special practical 

importance. These are f = x and f = x. x. . , . x . , where x. may b^ the 

1 1 i x b 1 

true or complement form of a system status variable. Since f may be 
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tested to be true or false (and the x's may be complemented), the second 
form is equivalent to the form f = + x^ + ... + x^. 

3. Schemes for Programmable Selections 

In order to expand the power of the branch instruction, it is desir- 
able to generate f under program control. 14 This also has the obvious 
advantage of enhancing reconfigurability for error control. This may be 
accomplished by providing a function-selection vector, say 
T = (t , t , ..., t ), as a subfield in a ^instruction word, and a 

*•— 1 S 

logic network with inputs T and X = (x^, x^), and output f. 

A convenient network for this purpose may be built, ULing a decoder 
with n £ 2 s outputs and a simple AND/OR network as shown in Fig. III-2. 

An extension of this scheme, which provides for a programmable reassign- 
ment of selection code, is shown in Fig. III-3. A E stage acts as a +1 
adder, such that output =1^+1 (arithmetic sum) if r^ = 1, or 

= T^, if r^ = 0. An arithmetic overflow, is produced if = 2 s . 
Furthermore, only one y will be 1. The cascade of U stages serves to 
select the x corresponding to the uniquely energized y, and to deliver 
it to the output f. The logic for a U stage is u i+1 = u + y i X i • The 
r variables serve to program the selection, since r^ - 0 causes stage i 
to be bypassed. This way of reassigning T codes to X variables permits 
reconfiguration without changing the T data in the ^program memory. 

This may be advantageous, since it permits use of a read-only memory. 
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FIG. 1 1 1- 3 RECON FI GURABLE SELECTION NETWORK 


4. Schemes for Programmable Generation of Boolean Implicants 

A high degree of modularity and programmability in the generation 
of boolean implicants may be achieved by using a certain functional unit 
that appears to have great utility it: modular computers. As shown in 
Fig. III-4, this unit consists of a single binary parallel adder and a 
scratchpad memory. (Such memories will be widely available in LSI form, 
and, in fact, the functional unit itself is a good candidate for LSI.) 



FIG. 111-4 PROGRAMMABLE BOOLEAN-IMPLICANT FUNCTION NETWORK 
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In this application we employ the following carry function for 
ea' *i state: 


= ( a i. + b i^ c i-l 


instead of the usual arithmetic function, 


= (a 4 t b i jc^_ 


1 * a i b i 


If a standard arithmetic unit is used, the term a^^ must be suppressed 
under separate control . 

Let the set of system status variables be X = (x g , x g _^, •••. x 2 » Xj). 

We wish to evaluate boolean implicants, using any specified subset of the 
input variables. For example, we may wish to determine if XgX,^ = 1. 

We apply both the variables and their complements to one side of the adder, 
e.g. , in the order (x g , x g , x gl , x g , ... x' 2> x^ x'), with x' at the 

least significant binary input, and we apply to the other side of the 
adder a "mask" vector, obtained from the memory in response to an externally- 
specified address code. 

Let the elements of the stored mask vectors be M = (p , q , p , , 

S S S“*i ^ 

... P 2 , q 2 , p , q x ), thus elements p^ correspond to the inputs x^ 
respectively. If x^ appears in the implicant in true form, set (Pj^) 
to (Ol); if it appears in complement form set (PjQj) to (10), and other- 
wise, set to (11)* 

til 

To carry function at the 2i stage is 

C 2i = (p i + X i )(q i + X i )c 2i-2 ’ 


thus, depending on the values of p^ and q^, we have 


'2i 


“ X i C 2i-2 ’ X i C 2i-2 ' ° r 1#C 2i-2 
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If the addition is carried out with c Q set to 1, the final carry 
c 2a will be 1 if and only if the implicant is true. 

As an example, let s = 5, and let the desired implicant be x 5 x'x 2 . 

The appropriate mask vector is M a (01, 11, 10, 01, ll). The boolean 
vector sum M + X is (l, 1, ..., l) (necessary and sufficient for c 2g = 1) 
if and only if X 5 X 3 X 2 = 1, 

A similar operation may be used to obtain the boolean sum of an 

arbitrary set of variables, each complemented or not. For example, in 

order to determine if (x. + x„ + x!) a 1, a test is made to determine if 

v 4 3 1' 

X 4 X 3 X 1 = °* This is done by creating a mask for x^x^x^ as described 
previously, and observing if the final carry c 2g is 0 (rather than *). 

More complex expressions may be realized using a single flip-flop 
to store the results of a succession of impllcants. If the flip-flop is 
initially set, and if a zero carry resets the flip-flop, a given sum of 
products will be true if and only if the flip-flop remains set following 
the last product test, 

5. Hierarchy Schemes 

Hie simplest structure for a m~ program unit provides for the direct 
application of the binary elements of the p-instruction word to the ex- 
ternal control lines. This structure is used in practice, but there are 
some advantages to be derived from the use of a hierarchical structure. 

One form, proposed by Grasselli , 15 is analogous to the so-called 
interpretiv e programming systems. In this form, illustrated in Fig. III-5, 
the number of different instructions is limited, and a compact code is 
established for their representation (each word is called a p-order) . 

A (x- program is recorded in a first memory as a string of p-orders. A 
"dictionary” of ^instructions is held in a separate memory, and, in 
operation, a selected p-order serves to address that memory. Hie merits 
of this scheme are several. First, most of the data storage, i.e., the 
dictionary, may be held in a fixed memory, while still preserving some 
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FIG. MI-5 TWO-LEVEL MICROPROGRAM SCHEME (after Gro tell i) 

flexibility in tne ^program sequence. Secondly, since the borders are 
in a compact code, several may be packed into a single word of a relatively 
slow memory (e.g,, the main working memory) without unduly slowing the 
control sequences. 


Another form to which the authors know of no prior reference of 
special value in error control is to provide some kind of modifiable 
translation on the components of the p-instructlon word. The purpose of 
this translation would be to change the assignment of activation signals 
to the external functional units, thus effectively reconfiguring the 
system, TTiis translation would be effected in a special network (a useful 
form for such a network would be an associative memory) . This scheme also 
has the merit of allowing the main memory to use a fixed data store. 
Another advantage is that for high degrees of redundancy, the capacity 
of the p-instruction store is minimized, since only the nonredundant 
control codes need be recorded. Figure III-6 illustrates the application 
of this method. 

Both hierarchy schemes have attractive features, but further inves- 
tigation is needed to determine the costs in speed and amount of logic 
needed for their Implementation, 
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FIG. 111-6 FIXED PROGRAM-VARIABLE TRANSLATION MICROPROGRAM SCHEME 
6. Conclusions 

Various alternatives in the design of ^program type control units 
have been reviewed. A number of schemes have been developed for real- 
izing major functions in modular and programmable forms and for permit- 
ting a substandard use of fixed-data type memory stores* However, the 
merits and costs of the various schemes need to be evaluated* It is also 
necessary to investigate appropriate means for applying logical redundancy 
for error control within the control unit. 

D. An Improved Realization for Switched- Adaptive Voting 
1 . Introduction 

In the von Neumann voting scheme, up to m errors in the outputs of 
2m + 1 nominally identical binary channels may be corrected by obtaining 
the system output as the majority function of the channel outputs. Pierce 
demonstrated the value of combining the outputs by a function that weights 
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the contribution of each channel according to its recent error rate. A 
continuous weighting function is discussed on page 371 of the Phase I, 

Final Report. As a special case, it may be seen that it is beneficial 
to completely disconnect a permanently- failed channel. 

Several logical realizations of the latter scheme are also described 
in Sec. IIA, 2b of the Final Report — Phase I* The simplest structure was 
obtained by use of a linear- input logic element in which the 0 and 1 states 
of the inputs were encoded a&> +1 and -1 signals with the output 1 if the 
sum of the inputs is at least +1 Since linear input elements are diffi- 
cult to realize in microelectronic J^chnology, a realization was developed, 
using only binary-valued logic elements. This scheme required rather 
costly logic networks that effectively counted the number of active chan- 
nels and also the number of channels displaying a 1 value. These counts 
were needed because the scheme called for replacing the output of a dis- 
connected channel by 0; hence, the number of inputs f to the combining 
network) that constituted a majority depended upon the number of active 
channels. 

2. Description of a New Scheme 

The following approach permits the use of a fixed majority network 
as the entire combining element. Let only one channel be disconnected 
at a time, then, as successive channels fail, replace their outputs by 
constant 0 or 1 signals, alternately. 

It is convenient to represent the inputs to the majority network as 
C 0 . C , and x, representing, respectively, the constants, 0 and 1, and 
the free variable x. As an example, for the case 2n» + 1 - 7, the follow- 
ing useful sets of inputs appear at the input to the mrjority network: 

(7x), ( 6x, 1C Q ), (5x, 1C 0 , lCj), (4x, 2C Q , H^), (3x, 2C Q , 2^), and 
and (2x, 3C q> 2C^) . Since the system output is 1 when four 1. 's are 
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(2x, 3C q , 20^). Since the system output is 1^ when four l^s are input, 
the corresponding minimum ratios of free ^ inputs to to cal free inputs 
required for system output 1, are 4/7, 4/6, 3/5, 3/4, 2/3, and 2/2 
respectively. 

Die overall structure of a realization of this approach is given in 
Fig. III-7a. Each of N channels, , ..., x^, is presented to a Majority 
Net via an AND/OR cascade, whose other inputs are obtained from a control 
unit. Thus, for example, the first input to the Majority Net is the func- 
tion (x^)(c il ) + (C 12 ) . By proper choice of the control signals, and 
C 12' the * n P ut ma y P® set to the values x^, 0, or 1. Die Comparison and 
Control Unit determines when a channel is to be disconnected and also 
generates the appropriate control signals. Inputs to the unit are ob- 
tained from the channels and from the system data output. 


A logical realization of the Comparison and Control Unit is illus- 
trated in Fig. III-7b. Die subnetworks corresponding to channels are 
arranged vertically, between dashed lines. The control outputs for a 
given channel are obtained from two flip-flops; thus, for channel 1, 

C 11 = F^F^, ant * C l2 - F 12* Ma J° rit y Net in P uts x j> £ and i are 

produced by control outputs C^C^ = 1, C^C.^ = 1 » and C^ 2 = 1, respec- 
tively. Diese outputs, in turn, correspond to the flip-flop states 
F 11 F 12 = 1, F n F 12 = 1, and F^ 2 = 1, respectively. Initially, all flip- 
flops are cleared to the F state, thus, setting all Majority Net inputs 
to the x state. In order to set the channel 1 Majority Net input to the 
0 or the _1 state, either F^ or F 12 is set to the F state, depending on 
the parity of the number of previously disabled states. Diis parity is 
recorded by a trigger flip-flop, F , the state of which is sampled by 

p 

the signal g^. 


Signal g^, which is derived from signal f^ is gated by an output of 
the STEPPER unit. Signal f^ is true whenever the channel data, x^, and 
the system data output differ, the difference being taken as indicative 
of a channel error. The STEPPER unit serves to scan the channels serially; 
this is done to avoid ambiguity in the determination of the parity of 
active channels, in the case that several channels may fail at the same 
moment. The STEPPER is assumed to return to a rest state autonomously. 
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FIG. 111-7 SWITCH ED-ADAPTIVE VOTING 


The presence of a difference in any channel is indicated by signal 
y^ = fj + f 2 + ... + f N> which, when true, starts the STEPPER, if per- 
mitted by the external control signal TEST. For each channel, if a g 
signal is produced, a y g = g^ + g^ + . . . + g^ signal is produced, which 

triggers the parity flip-flop, F . The information in F is also avail- 

P P 

able in the set of F^ status flip-flops. 

In a system composed of a number of such voting stages, various 
portions of the control unit may be centralized. For example, a single 





STEPPER unit could serve all stages, and in the extreme, all the control 
logic could be programmed, with the exception of the status flip-flops. 

The minimum cost of the stage, assuming that, of the control unit, 
only the status flip-flops and their input gates are realized in hardware, 
is 2N flip-flops, 4N gates, and one majority network. The cost of the 
latter is about eight three-input gates for N = 5, and about eighteen 
three-input gates for N = 7. The total number is thus 10 flip-flops and 
28 gates for N = 5, or 14 flip-flops and 46 gates for N-7. For comparison 
(also excluding the comparison logic), the scheme described in the Phase 1 
report (pp. 74 and 85) requires about 5 flip-flops and 50 gates for N = 5, 
and 7 flip-flops and 95 gates for N = 7. 

Assuming that a flip-flop is equivalent to four gates, the equivalent 
gate costs for five- and seven-input stages are 68 and 102, for the new 
scheme, and 90 and 123, respectively, for the old scheme. The avail- 
ability cf a three-state storage element would allow further economies. 
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IV PRINCIPLED OF COMMUTATION NETWORK DESIGN 



A. Introduction 

1. Commutation Requirements 

In Chapter III, a multiprocessor system was proposed for which a 
set of simultaneous one-to-one connections could be established between 
working memory modules (WM) and simple processor and control modules (SP), 
and also between SP modules and arithmetic logic units (ALU) or back-up 
memories (BAft ) or input-output devices (l/O). Also discussed was the 
possibility of "repairing" an SP, WM or ALU module wherein the module is 
realized in a byte-sliced manner. In such a case, the repair operation 
requites the routing of the data, previously destined for a faulty slice, 
to a succeeding slice. The networks that perform the data switching, 
which is inherent in the operations of module assignment and repair, are 
called commutation networks . 

We have classified two types of data commutation for the module 
assignment, namely (l) complete permutation — complete utilization, 

(2) complete permutation — incomplete utilization, and three types of 
commutation for the repair operation, namely (3) incomplete permutation — 
order preserving, (4) incomplete permutation — nonorder preserving, and 
(5) "shorting." These five commutation functions are described schemati- 
cally in Fig. IV-1, where the specific applications of each function are 
also listed. 

The assignments associated with the complete permutation — complete 
utilization [CPCU(N)] function is probably obvious. The commutation net- 
work is to be capable of permuting, in an • rbitrary manner, a set of 
N input data lines, emerging, for example, from a set of memories to a 
set of N output lines incident to, for example, a set of SP units. In 
the illustration, a data transfer path represents a parallel set of 
lines containing many bits (24-56). The assignments associated with 
the complete permutation — incomplete utilization [cPIU(N,m)] function 
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FIG. IV-1 CLASSIFICATION OF DATA COMMUTATION REQUIREMENTS 








differ from those associated with the CPCU function in that for the former 
only a subset containing m inputs of the total of N inputs and outputs^ 
needs to be interconnected at a given time. For the incomplete permutation — 
order preserving [lPOP(r,m)] function a subset containing m inputs of the 
r inputs, say for example associated with the working byte slices of a 
simple processor and control unit, is to be connected to a subset of the 
outputs, say associated with the working byte slices of an arithmetic 
logic unit, but with the restriction that spatial ordering of the input 
signals is to be preserved at the output. The preservation of order is 
clearly required since the data to be commutated is a binary number. The 
assignments associated with the incomplete permutation — nonorder- 
preserving [lPNOP(r,m)] function differ from those of the IPOP case in 
that, for the former, preservation of order is not a requirement. For 
the shorting function the outputs of a given byte slice are either to be 
connected to the succeeding stage (slice) or "shorted" around that suc- 
ceeding slice. 

For any commutation function it is desirable to synthesize networks 
for which the following are true: 

(1) The design is economical 

(2) The network setup is not difficult 

(3) The data transfer is rapid 

(4) Failures in the commutation network do not disable 
either the commutation network or the modules served 
by the network. This tolerance to CN failures should 
be achieved with minimal increase of network com- 
plexity. 

(5) If the commutation network is "repairable," the diagnos- 
tic routines should be easy to specify and of minimal 
length. t 


* We are assuming that for all commutation requirements there are an equal 
number of input and output terminals; this is clearly not a necessary 
restriction and it is only postulated for convenience in description. 

+ The "length" of a diagnostic routine is defined to be the maximum number 
of sequences of input data required to diagnose the network. 
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. Prior Solutions to the Commutation-Network Design Problem 


The obvious solution to the commutation- network design problem relies 
upon the use of a single-level crossbar switch, similar to the type com- 
monly found in central telephone exchanges. Figure IV-2 represents, 
♦schematically, a crossbar switch serving a set of WM* s and SP's. Here 
the crossbar, where each single-pole single- throw switch represents a 
crosspoint, fulfills the requirements of both the CPCU(n) and IPOP(r,m). 


Clearly a crossbar with N r crosspoints would De sufficient, but it was 

2 2 2 

shown 1 that actually N (r - m ) crosspoints are sufficient. In any event, 


it is seen that 28*10 crosspoints are required to serve 25 (=N) processors 
and memories, 32 (=r) total bytes, and 24 (-m) bytes are required for com- 
putation. Although a multiprocessor of this complexity might appear un- 
reasonable, there is considerable motivation to seek more economical 
commutation network designs. 

Goldberg 1 > 16 described a serial transfer network for the IPOP func- 
tion which exhibited a complexity proportional to 2N, but, even with the 
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use of speed- independent logic, the data-transfer rate is probably not 
adequate for our application. 


3. Hie Primitive Building Block of Commutation Networks 


Most of the commutation networks described in following sections 
will be composed of interconnections of the "cell" shown in Pig. IV-3. 
The cell, which is very simple, behaves as a reversing switch, under the 
co: trol of a single memory flip-flop. 


In essence, it is a double-pole, double-throw reversing switch, 
controlled by a storage element (e.g., a flip-flop). In addition some 
means is provided by which the storage element is set to the desired 
state. Figure IV-3 (a) shows a relay-contact version, analogous to cir- 
cuits in the MOS technology, and Fig. IV-3(b) represents a NOR gate- 
realisation of the cell in question. The two mcdos consist of a crossing 
Fig. IV-3 (c) and a bending Fig. IV-3(d) of the pair of input leads to the 
pair of output leads. Figure IV-3(e) represents a redundant flip-flop 
version of the cell, for which any single component failure will result 
in one of two possible failure conditions. In the first failure condition, 
which we will call the "stuck function" condition, the cell can realize 
only one of the two possible modes, i.e., the bend or the cross. In the 
second such condition, which we shall call the "bad-output" condition, one 
output lead contains a faulty signal. Table IV-1 summarizes the failure 
conditions resulting from various component failures of the cell of 
Fig. IV-3(e). 



Table IV-1 

FAILURE CONDITIONS FOR BASIC CELL 


Component Fault 

Failure Condition 

Faulty OR gave 

Bad-Output 

Faulty 2- input AND gate 

Bad-Output 

Faulty 3-input AND gate 

Stuck- Function 

Flip-flop stuck in a mode 

Stuck-Function 

Same logic value on two outputs 
of a flip-flop 

Bad-Output 
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B. Commutation Networks for Complete Permutation — Complete Utilization 
1. Nonredundant Networks '* 

It is easy to see that a CPCU(n) network must contain enough two- 
state cells^ to specify all NJ possible permutations; thus, N^(n), the 
number of cells in the network, is given by: 

Nj(N) a log 2 (N1 ) 


or, asymptotically (from Stirling's formula), 

N, (N) S N log N - 1.443N +0.5 log„ N 
X 2 2 


| 

1 



m 


In Fig. IV-4(a) we depict a CPCU network. The subnetworks ? A and P Q are 
themselves complete interconnection networks, each with half of the number 
of Inputs of the total network. This arrangement requires a number, N^N), 
of cells which satisfies: 

N (N) = N 1 [(N/2)] + HjtM - (N/ 2)3 + N - 1 

Solution of this recursion for N = 2 r , [using N^(2) = l], gives 

N 1 (2 r ) = 2 r (r - 1) +1 , 

or 




ft 



N X (N) 


N log 2 (N) - N + 1 


§ 


k 

Ihe nonredundant networks for the case N » 2 described below were dis- 
covered under another contract, and are described in two forthcoming 
papers. 17 ! 18 All of the other resultr were obtained under this present 
contract, as well as the generalization of the nonredundant network to 
cover the case where N / 2^. 

^ For the CPCU applications of interest to us, the cell is actually an 
r-pole, double- throw, reversing switch, where r is the numoer of output 
lines contained in, for example, a WM. It is shown in Sec. IV-B-3 that 
it is convenient for reliability enhancement purposes to byte slice the 
CPCU network so that the number of "poles" each cell contains is con- 
siderably less than r. 

* It can be shown that the same value of Nj(N) is obtained for arbitrary N. 
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A constructive proof that this network is capable of performing an 

arbitrary permutation is as follows. As previously, let the inputs and 

.outputs be labelled X^, X^, ... X^ and Y^ , Y^, ... Y^, respectively, in 

order as shown in Fig* IV-4(a). We start by building a path backwards 

from Y, through P. , and th tough whichever inpu'. cell is connected to the 

input that is supposed to be connected to Y^. (The routing of signals 

within P and P will be determined in later steps . ) A forward path is 
A B 

now formed from the other input Xy^; sharing the „ «me input cell with Xyy 

through this input cell, through P_, and through whichever output cell is 

connected to the desired destination (Yj, say) of this input line. In 

like manner, the mate to this new output is traced back through P^ to its 

associated input, and the mate of this new input is connected in a forward 

path through P to to its intended output. This path-building process is coir 

tinned, alternating use of P and P , until output Y is reached. If not 

A B A 

all input 'Output connections have yet been completed* then start with any 
unconnected output line and continue the process, going through the cycle 
as many times as necessary, until all input and output cells are set in 
cither the "bending** or ’crossing" mode. 

The entire procedures are now repeated separately for each of the 
interconnection arrays P A and P , etc. , until the entire array has been 
set up. The procedure will never be forced to stop because of the lack 
of connection, since P and P^ are each connected to every input and output 
cell— except Y for P fi and Y for P^ r and these connections are avoided by 
starting as indicated. The array for X = 3 (with some of the connections 
straightened and the outputs renumbered) is shown in Fig. IV-l(b). 

It is of interest to investigate efficient techniques for setting up 
the network cells to realise the necessary mode for a particular permuta- 
tion. Consider, for example, the network of Fig. IV-4(b^ redrawn in 
Fig. IV-4(c), and assume that we require the setting cf the cells as shown. 
Referring to Fig. IV- 3 ve sec that each cell is in the crossing mode upon 
resetting the flip-flop. The cell is set to the bending mode by the co- 
incidence of logic-one vriues on the data inputs and the P input. Clearly, 
then, by applying a logic-one value to tne P input of all ells in a given 
level of the network — in Fig. IV-4(cl a set of values, X^ - = I, 
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FIG. KM NETWORK FOR COMPLETE PERMUTATION — COMPLETE UTILISATION 








will net the cell nerving and X^— and appropriately setting the N input 
signals, the netwrk can be set up in a period proportional to the number 
of levels in the network. 

a. Byte-Sliced Commutation Networks 

In this section wo are concerned with the behavior of the CPOT net- 
works under cell fault conditions and a simple technique of accommodating 
to these faults. 

It is clear that for the case wherein the basic cells are double-pole, 
double-throw, reversing switches, each bad-output cell failure results In 
an error only on a single output of the network. Similarly, each stuck- 
function failure results in an error on a maximum of two outputs. In the 
latter case, it is sometimes possible to accommodate to this failure type 
by appropriately setting the working cells of the network. This accom- 
modation technique is discussed In detail in Sec. IV-B-3. Unfortunately, 
for she case of r-pole cells, a single failure could result in the inability 
of the CPCU network to real Ice several of the SSi-SP assignments. 

Tills state of affairs is significantly improved by byte slicing the 
CPCU network as shown in Pig. IV-5. In this case, the bytes (where each 







byte is assumed to contain b-bits) of the WM's are permuted in separate 
networks. That is, the first CPCU has, as inputs, the first byte of 
each WM, the second CPCU has, as inputs, the second byte of each WM, etc. 
The outputs of the first CPCU are ultimately directed to first byte of 
each SP, etc. It is thus seen that for this byte-sliced realisation, 
which, of course, requires the distribution of the cell memory flip-flops 
among each of the CPCV's, a cell failure disables a single byte. These 
commutation network byte failures can be accommodated for in the iden- 
tical manner proposed for other byte failures, i.e., by the use of the 
incomplete permutation— order-preserving networks described in Sec. IV-D. 

3. CPCU networks Insensitive to Cell Failures 


a. The Stuck- Function Fault 


For the moment, consider only the caso in which the network 
is fault-free or has precisely one bad switch (cell). Figure IV-fi(a) 
illustrates a straightforward solution to the single error-correction 
problem. If P^ and P^ are both full permutation networks, then a fault 
occurring in one of them (such a fault being of the stuck- function type 
that docs not diaturb lead continuity) has no effect on the operations 
of the other network. Obviously, this is a rather wasteful approach 
since all of the remaining switches In the network containing the fault 
contribute nothing toward forming the desired permutation. Instead, let 
P^ of Ftg. IV-8(a) represent a permutation (CPCU) network, and P^ be o 
network specifically designed to undo the damage caused by a fault in P^. 
Then If a fault occurs in P^, it can be repaired by P^, while a fault 
in Pj causes no trouble because P 2 is a full permuter. What is required 
of the network Pi? A single fault in P^ causes a simple interchange of 
some particular pair of leads at seme cell within the network. Tills can 
only become manifest as a spurious reversal of exactly two leads at the 
output. Of course, the possibility exists that the switch might fail in 
the correct position and no trouble would occur. In any event It would 
be sufficient that the network Pj be capable of effecting the Interchange 
of an arbitrary pair of Input leads without changing the relative assign- 
ments of the other input leads. 
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The double- tree (D-T) network of Fig. IV-6(b) eon do this job. 


Any pair of input leads can be directed to some switch on the left of 1 

the center switch. At this switch (to the left of the center switch), * 

the leads may be Interchanged. Whatever switch settings are required to j 

do this interchanging, wc copy by reflection in the (imaginary) center- I 

line to the right-hand part of the network with the exception of the f 

switch that actually effects the interchange. The corresponding switch § - 

in the right-hand part of the network is set in the opposite state. j 

The scheme is illustrated in Fig. IV-6(b) lor the particular case of j 

N = 8 In which we wish to Interchange inputs X„ and X.. Here the switch I 

A J j 

settings to the right and left of the center line are identical and the 
center switch effects the desired interchange. Similar networks exist 
for all values of N, and are obtained by pruning the corresponding tree 
networks for the next largest power of two greater than flu 

] 

When one of these double-tree networks is placed in tandem with 
a full permutation network, we are able to correct the effect of one j 

switch failure wherever it may occur in the composite total network formed 
by P 1 and P . If P is one of the CPCU type commutation networks we have 
already considered, it will be found that the input peripheral switches 
of P„ and the output peripheral switches of the D-T network match up into 

m' 

tandem pairs of individual switches* Note, for example, the pairing of 
leads at the input of the network of Fig. IV-4(b) and the similar pairing 
of output leads in Figs IV-6(b). Whenever such a pairing occurs, we may 
omit one of the two switches if we have provided for the possible failure 
of one or the other of them. We may therefore omit the entire column of 
peripheral output switches from any of the D-T networks whenever we ad- 
join them to one of the CPCU networks. The single-error correcting capa- 
bility is not affected by this pruning operation. Call the network that 
results from the removal of the output switches from the double-tree net- 
work the TDT(N) network (for truncated double-tree of N leads). Then j 

single-error correction of any CPCU(N) network can be obtained by the j 

tandem addition of one TDT(N) network. Furthermore, if the possible ef- 
fect at the output of the CPCU (N) of the multiple failure of switches is 
considered, it turns out that the addition of £ TDT(N) networks in tandem 

i 

j 

I 
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to any CPCU(N) network will suffice to correct as many as j) switch failures 
in the total network. The argument, which is much the same as the fore- 
going one, depends on the possibility o* uecomposing all such multiple 
failures into separate pairwise lead interchanges. 

To estimate the cost of error protection according to the fore- 
going scheme, we note that the TDT(N) networks contain approximately 
(3/2)N switches. To correct £ errors then takes about 3/3 (pN) switches, 
thus, if N is very large, we can correct a ’’few" errors at a cost that 
is small compared with the total number of switches in the network 
(paN log 2 ft). On the other hand, correction of multiple errors with 
TDT(N) networks does not furnish a recipe for creating arbitrarily reli- 
able networks (in the Shannon sense) while still meeting the asymptotic 
cell count, l.e.,«*N log^ N» We have not yet discovered how to resolve 
this problem although a solution seems possible. If anything can be con- 
cluded from results obtained for small values of N, it seems that single 
fault correction should be obtainable at a cost of <log 2 N) 4f extra switches 
in excess of the CFCU(N) count. In particular we can exhibit specific 
networks that correct one fault and have switch counts as indicated in 
Table IV-2. In each case the number of switches shown in Table IV-2 is 

Table IV-2 

NUMBER OP CELLS IN SINGLE STUCK- FUNCTION 
CORRECTING CPCU NETWORKS 


Leads 


Switches in 
Redundant 
Network 


CPCU(N) 


The ^ x > is the next integer larger than x, 
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exactly (leg 3 ff) larger than the corresponding value of CPCU(N). In the 
cases S * 3, 3 and 4, It la reasonably certain that these realizations 
are the minimal ones that exhibit the single-fault correcting property. 

Ah example of a single-fault correcting network for N * 4 is illustrated 
in Pig. IV-d(c). 

A considerable number of network forms and Iterative combining 
rules were studied during attempts to establish the minimum cost of fault 
correction. Several time-shared computer programs were written that allow 
heuristic tinkering with networks , l.e«, the prog*i...i BAD-ONE that verifies 
whether a proposed network la indeed a full permuter If one cell has a 
stuck- function fault. As a result of all this experimentation! one feels 
Impelled to make the following conjecture: 

The cost of protection against (correction of) 

£ faults In a CPCU('!) network is no more than 
the difference In cost between CPCU(n) network 
and a CPCU(N + p) network. That is, the cost 
of correcting each additional fault, say fault 1, 
is smaller than (log 2 (N + i)). 

If this conjecture Is Indeed true, then there would exist per- 
mutation networks of arbitrarily high degree of reliability whose cell- 
count would not exceed K*N^(N), where K is a constant related to the 
probability of a switch failure. 

b. An Alternative dingle Stuck-Functlon Correcting Construction 

A different and slightly more economical [than the TDT(N) device] 
method of providing single-fault protection in CPCU(N) networks stems from 
the observation that the construction of Pig. IV-4(b) can tolerate one 
cell failure in any peripheral cell if an extra cell, serving outputs Y^ 
and Y^ column, Is retained rather than deleted. The reason for deleting 
this cell in the first place was that exactly one peripheral cell is un- 
necessary in the nonredundant case. Since all of the peripheral switches 
are exactly equivalent in function, it makes no difference which one we 
delete. Hence, by retaining all of them we can ignore exactly one failure 
in any of them. If each of the "internal" permutation networks that im- 
plement the construction of Pig. IV-4(b) are augmented in the same way 



by replacing the deleted per 
then we obtain a network « hill 

failure anywhere* An example of-;- such o' network, namely 8^(4) , #hOWh!SSi 
in Fig. 1V-7. We notice first that the number of cells . ft » 8 so that •■ 
the network lg not minimal [gee Fig, IV-0(c)], However , ou counting the^ 
redundant celle required when H »-3- T we ^Ind-thatUitte —ex tra cell! are ■ 

■ ‘ - . - ' • y , - 

exactly N - 1 in number. Thle meane that in the apecial cage N m 2 the 
networks 8^(1*) are more economical than those produced by annexing a 
TDT(N) network to a CPCU(n) network. Recall that this required about 
3N/2 extra switches. 



FIG. IV-7 NON “MINIMAL REDUNDANT 4-PERMUTER, SINGLE STUCK- FUNCTION 
CORRECTING 


c. Correction of Bad-Output Fault Typea 

Considering the effect of a single fault of this type on an 
otherwise full permutation network, it becomes apparent that exactly one 
output lead will receive an incorrect signal in response to the applied 
Inputs. The situation can be remedied by adding one extra lead to the 
N-permutatlon network if we can also be certain that the input signals 
are applied to, and the output signals derived from, the correct subset 
of N leads. One way of accomplishing this is illustrated in Fig. IV-8,* 
The figur" is drawn for the case N = 4, but the method is perfectly 


* Clearly a bad-output failure on a cell immediately preceding a network 
can never be accommodated for. In this case the appropriate byte is 
disabled. 
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FIG. JV-8 NETWORK FOR CORRECTING BAD-OUTPUT FAULTS 


general. To permute N leads, we employ a nonredundant N + 1 permuter 
flanked on input and output with a ladd er network having N switches. If 
the N + 1 permuter has a bad cell lead, this will show up as a failure 
of one of the signals on leads A, B, C, D or E of the internal N + 1 per- 
muter to arrive at its specified output. By setting the switches of the 
ladder network in the obvious manner, the fault can be corrected, as 
illustrated in Fig. IV-8 for the case wherein input C does not arrive 
correctly at internal output 4. 

The cost of correcting one failure by the method of Fig, IV-8 
is 2N switches for the ladder networks plus \log (N + l)) extra switches 
[assuming the CPCU(N) construction of Fig. IV-4(b)] to implement the 
N + 1 permuter rather than the N permuter. This cost, ^2N + log^ N, is 
asymptotically negligible compared with the cost of a CPCU(N) network 
for large N. It is obvious that the foregoing construction can be extended 
to correct multiple faults cf the bad-output type. 
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C. Commutation Networks for Complete Permutation — Incomplete Utilization 


We are assuming here that the commutation network is to serve N in- 
puts and N outputs [similar to the function of the CPCU(N) network], but 
in this case it is only necessary to provide simultaneous connections 
between m(m < N) inputs and outputs. Such a commutation function, re- 
ferred to as embodying complete permutation — incomplete utilization, 
is denoted as CPIU(N,m) 0 It is desired to specify a network that is more 
economical in terms of cell-count and/or is easie; to set up than a CPCU(N) 
network which, of course, also achieves the CPIU(N,m) function. 

It is easy to see that a CPIU(N.ik) network must contain enough tuo- 
N 2 

state cells to specify (* ) (in! ) possible permutations; thus, N (N,m)> the 

m J 

nvr ‘_*er of cells in the network, is given by 

N 2 (N,m) - log 2 (m*)J 


or 


N 2 (N,m) Si 2 log 2 (^) + N' 1 (m ) 


The above formulas suggest that the CPIU(N,m) function could b-i realized, 
as depicted in Fig. IV-9. by a network composed of a CPCU(m) block (m- 
permuter) sandwiched between two combination networks. 

It is assumed that a combination network serving m inputs and N out- 
puts, denoted as a COM(m,N) network, is to have the capability of connecting 
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FIG. IV-9 SCHEMATIC REPRESENTATION OF DECOMPOSITION OF A COMPLETE 
PERMUTATION INCOMPLETE UTILIZATION NETWORK 


58 




the set of m inputs onto any specified set of m output leads, without 
regard to the order of these signals on the outputs. A similar definition 
applies to a OOM(n,m) network where the number of inputs is assumed to 
exceed the number of outputs. 

Ttie number of cells, N (N,m), required for an m,X (or an X,m) com- 

c 

bi nation network is given by: 


— 

* (».m) ^ log ^ log - ,7< r y * v, 

c 4m 4 m. (N - m / • 


Asymptotically, from Stirling’s formula, we find that a network composed 
01 N - 1 cells should be sufficient to perform the combination function. 

We have not found combination networks, composed of the 2-input basic 
cell which approach this X - 1 cell bound, as closely as the CPCtf(X) net- 
works approach the (log^ XJ) bound. However, it is not difficult to specify 
a COM(N,m) network that requires only N two-state cells, but wherein each 
cell contains m + 1 inputs. 


Consider the two-state cell depicted in Fig. IV- 10, with m horizontal 

inputs (m = 3 for the case shown), 1^, I^, 1^, m horizontal outputs 

0 , 0 , 0 , one vertical input, X., and one dummy (unused) vertical 

14m l 


'2 

I -j 


o, 

°2 

°3 


X- 


io) 

CROSS MODE 



FIG. iV-10 BASIC CELL WITH AUGMENTED SET OF INPUTS 


As indicated in Sec. IV-B-1, if a network is commutating parallel data 
channels, each call input is in reality a bundle of many wires. 
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output. In the ‘cross" mode, the three horizontal inputs are transferred 

unaltered to the three horizontal outputs. In the bend mode we effect 

the transformation X. -* 0 , I - # 0 , . . . , I -* 0 . An m,N combina- 

1 mm m- 1 « l 

tion network is easily synthesized as a cascade, containing X of these 
augmented input cells, as shown in Fig. IV-11 for the case m - 3, N* = 6. 

The appropriate cell modes are shown for the case where it is desired to 
connect the inputs X.,, X^, and X^ to the three outputs, , Y t> , Y^, which 
are the horizontal outputs of the lost cell in the cascade. It is noted 
that the first cell in the cascade actually requires no horizontal inputs, 
the second cell only 1 horizontal input, ..., the mth cell only m - 1 hori- 
zontal inputs. However, if we consider the approximate cost of a cell to 
be proportional to the number of terminals served, then the cost of the 
combinational realization of Fig. IV-11 is of the order of raX. We can 
find realizations that require a number of cells in excels of X, yet 
exhibit a cost measure significantly less than the network previously 
described. 

We will recursively synthesize a COM(m,X; network that is composed 
of the basic two-input cell , using a procedure suggestive of the one 
describaa in Sec. IV-B-1. Consider the COM(m,X) network represented by 
Fig. IV- 12, where it is assumed that 2|m and 2jx. The subnetworks, 
and Q^, are themselves combination networks, each with half of the number 
of inputs of the total network. This arrangement requires a number, 

N (m,N), of cells which satisfies: 

N 2 (m,X) = 2N 2 (m/2, X/2) + | - 1 . 


Solution of this recursion for X = 2* , m = 2* *, w using X f> (l,2) - 1*) 
yields 


«> 





> - 




2 r ’ 1 . (v 


2)2 


r-1 


* It is clear that a single two-input cell, where one of the inputs is 
not used, is a 00X1(1,2') network. 
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m IV 12 RECURSIVE APPROACH TO m-N COMBINATION NETWORK 


Of 


.N^ru 1 " 1 , a 1 *' 


iog 2 (n> - 1 ♦ 1 


A similar expression can be derived for the case wherein m, H >ik not 
powers of a. 

A constructive proof that this network in capable oi developing a 
path fro tii the m inputs to utt arbitrarily selected set >*> t-t outputs is us 
follows. Let the inputs and outputs be labeled X^ , X SJ , , ,,, X^ , ami 
\C » Y,,i . Y v us indicated in fig. 1V-1U* He will start bv indicating 
the appropriate modes for the (N - 1 ( output^ cells serving out- 
puts Y„» . Y.. , so that for un arbitrary select ion of n outputs, 

exactly mj2 of these outputs are connected to amt mfa to If an 

output cell serves two selected outputs, it can be arbitrarily set to 
either mode. The output cells which serve the remaining set of selected 
outputs arc then set to the appropriate mode so that the first of these 
outputs (including possibly Y^ is connected to , the second connected 
to Q 2 , etc. 

The entire procedure is now repeated for each of the networks Uj 
and Q fJ> etc*, until the entire network has been set up, The network fur 





FIG IV 13 4 0 COMBINATION NETWORK 


m I, N • m, 1m shown in K i g . IV-13, whvrv the cell functions arc indi- 
cated for the solve ted output set Y_ , Y,, Y_, Y-. 

2 19 9 

It Is interest lnt{ to investigate techniques for tncorporst ing re- 
dundancy Into these combination networks such that they can continue to 
eff*ct a given assignment In the presence of cell failures. Considering 
th« * tuck- funct ion failure, a simple technique can be specified, for 

*»»: I • failures of thin type, similar to the method described in ,Soc-B-3(b). 
In terring to Ft g. IV-12, we note that an output cell was not required to 
serve «nit puts Y f and V , and a corresponding cell was omitted from each 
internal subnetwork. ll we insert this cell In each case, und In addi- 
tion duplicate the cells serving the input leads, we will have added 
n - I cells and produced a network that is tolerant to single stuck* 
fun* t i*»n failures. In Fig. 1V-1I we display such a redundant I - H com- 
bination network and indicate the appropriate cell modes, so us to connect 

in. inputs t** Y , Y when* in cell 3 la ussumed stuck in the cross mode. 

1 *! 

lor the case of bad-output failures, a technique similar to that 
described In Sec. IV-B-Sfc' can be applied. The technique for u single 
failur* in the combination network case would require u COM(m ♦ 1, N ♦ 1) 
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FIG. IV-14 REDUNDANT 4-8 COMBINATION NETWORK FOR CORRECTION OF SINGLE 
STUCK -FUNCTION FAILURES 


network flunked by ladder containing m cells on the input and a ladder 
containing N cells on the output. The technique generalizes easily to 
accommodate to multiple failures. 

D. Commutation Networks for Incomplete Permutation — Xonorder Preserving 

It may be recalled that incomplete permutation — nonorder-preserving 
networks are required to establish connection paths between a set of in- 
puts and outputs, where only a subset of the inputs and outputs are re- 
quired for a particular task, and the spatial ordering of the signals at 
the input does not have to be retained at the output. One application 
of these networks, which has been described, is concerned with the trans- 
fer of data between registers where a redundant set of registers is 
specified. * 

We note, somewhat trivially, that a COM(m,N) network would function 
as an incomplete permutation--nonorder-preserving network serving a non- 
redundant set of m inputs and a set of N outputs of which N-m ore redun- 
dant. In this section we will be concerned with networks for the incom- 
plete permutation — nonorder-preserving function, IPNOP(r,m), wherein there 
are r inputs and r outputs, and it is necessary to connect m input-output 


* Note that this application is quite different from the case wherein 
we are concerned with the transfer of information between registers 
that have redundant bits. In the latter case w ? e would require an in- 
complete per mu tat ion-- order- preserving network. 
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pnirs together without regard for spatial order. For example, if r = 6, 
m = 3 and we distinguish inputs X^, X^, X 4 and outputs Y^, Y^, Yg, then 
the network is functioning properly if it would establish one of the fol- 
lowing assignment sets, [x^ - Y^, Xg — Y 2> X^ - Yg], orCXj -• Yg, X g -* Y 2 , 
X, - Yj], etc. 

It can be shown that a lower bound on the number of two-state cells 

/ r r 

required for a IPNOP(r,m) network is between log 0 ( ), and 2 logo ( ). 

* m “ m 

In Section IV-E, which is c< icerned with the order preserving case, we 
will describe an incomplete permutation — order-preserving network com- 
posed of 2r two-state cells that can, of course, function is a IPNOP net- 
work. However, each cell in the network must serve up to m + 1 inputs, 
providing an overall network cost that approaches 2r(m + l). We will now 
describe an IPNOP(r,m) network that requires more than 2r cells, yet ex- 
hibits a cost measure significantly less than 2r(m + l). 

Consider the network show^n in Fig. IV-15, which we will demonstrate 
yields recursively an IPNOP(r,m) network, where R^ and are each 
IPN0P(r/2, m/2) networks. This arrangement yields a number of two-state 
cells, N„(r,m), which satisfy 

(r ,m) = 2N^ (r/2 , m/2) + r - 2 

Solution of this recursion for 4 = 2 k , m = 2^ * and, using N^(2,l) = 1,* 
gives 

N 3 (2 k , 2 k_1 ) = 2 k (k - 1) - 2 k_1 + 2 


The proof of tills bound is deferred until the final report under this 
contract. It appears for this case that the lower value is the tighter 
bound. 

+ It is clear that a single two-input, two-output cell is an IPNGP(2,l) 
network. It is also a IP0P(2,l) network, a property which will be ex- 
ploited in the succeeding section. 
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FIG. IV- 15 RECURSIVE APPROACH TO INCOMPLETE PERMUTATION — 

NONORDER PRESERVING NETWORK 

or 

3 

N 3 (r, r/2) = r log 2 r - 2 r + 2 

k** 1 

Although we considered only the case m r/2 ~ 2 , the recursive tech- 

nique is quite general and will yield a network corresponding to arbitrary 
parameters, r, m. 

A constructive proof that this network is capable of finding a mate 
for each Input and output contained in an arbitrary set of m inputs and 
m outputs is quite similar to the proof provided in Sec. IV-C for the 
combination network. For the network of Fig. IV-15 it is apparent that 
the peripheral cells immediately serving inputs X 9 , and the 

cells serving outputs Y^ , «*.# Y^ can be set so that exactly half of 
m distinguished inputs are directed to and half are directed to R^; 
an identical requirement is satisfied for the rn distinguished output 
leads. The entire procedure is repeated for each of tile networks R^ 
and R 9 , etc., until the entire network has been set up. 
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Procedures, similar to those described in Secs, IV-B-3(b) and IV-B-3(c), 
can be applied to the IPNOP(r,m) network, so that the network is tolerant 
to stuck-function and bad-output type faults. For the stuck- function case 
we note, by referring to Fig, IV-15, that a single cell is missing on the 
input and output portions of each subnetwork. The insertion of this cell 
at each level in the network will yield a network tolerant to single stuck- 
function cell failures, 

E. Commutation Networks for Incomplete Permutation — Order Preserving 

It may be recalled that the memory modules, arithmetic logic units, 
and simple processor and control units can be realized as a cascade of 
identical byte slices. These modules can continue to function, upon the 
occurrence of failures in slices, if several spare slices are provided, 
and if a commutation network is provided to route the signals between 
operating slices. The function of such a commutation network, denoted 
as an incomplete permutation — order-preserving network, IPOP(r,m), is to 
set up connecting paths between an arbitrary set of m inputs and an ar- 
bitrary set of m outputs, both sets of which are subsets of the r inputs 
and r outputs, r > m, so that the signal order is the same at both input 
and output. 

Similar to the nonorder-preserving case, it can be shown that a 

lower bound on the number of two-state cells required is log 0 (— ) and 

2 m 

2 log (-), although for the order-preserving case, the upper value ap- 
2 m 

pears to be tighter. It is possible to realize the IPOP(r,m) function 
in a network composed of 2r two-state cells, where each cell contains 
m + 1 inputs. It is seen that this network approximately satisfies the 
2 log 2 (~) bound for the case m =* r/2 since 

lirn log^ ( [|] ) = 2r - 2 


The basic cell is the type shown in Fig, IV- 10, and the network is dis- 
played in Fig. IV-16 for the case r =6, m = 3; the modes of the cells 
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are such as to realize order-preserving connections between inputs X^ 9 
X , X and outputs Y , Y & , Y . Even though the number of horizontal 
inputs served by the first m - 1 cells in the cascade and the number of 
horizontal outputs served by the last m - 1 cells in the cascade can both 
be reduced, the cost of this network is of the order of 2mr, a consider- 
able cost. Similar to the situation involving the other commutation 
functions, the cost of the realization is signif icantly reduced if the 
tvo-input cell is used as the basic primitive block. 

A network composed of two-input cells, which realizes the IPOP(r,m) 
function, is identical to the network (Fig. IV-15) for the nonorder- 
preserving case. We have redrawn the network of Fig. IV-15 as Fig. IV- 17 
for the case r = 8, m = 4. It may be recalled that for the nonorder- 
preserving case the function of the (r - l) output cells and the r - 1 in- 
put cells was to connect the distinguished m input and m output lead* to 
the two subnetworks r an a R^ , such that exactly m/2 distinguished inputs 

and m/2 outputs were connected to each of the networks R and R . For 

X z 

the nouorder- preserving ease the input and output cells could be set in- 
dependently; such is not the case if. the network is to be used as an 
order preserving network. 



FIG. 1V-17 RECURSIVE APPROACH TO INCOMPLETE PERMUTATION — 
ORDER PRESERVING NETWORK 
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The following procedure, for the order-preserving case will indicate 

the proper mode of each cell of the network of Fig. IV- 17, for a given 

set of m inputs and outputs; and, hence, will prove that the network can 

function as an order-preserving network. Let the distinguished set of 

ra inputs be X* , Xj , ..., X i , where i > i A if a > 8, and the m outputs 
i £ m zt c 

be Y j ^ , Y j 2 > Y j m » where j > j 2 if q > 3. Consider each of the 

m inputs as residing in one of two disjoint groups. Group contains 
those inputs that do not share an input cell with another distinguished 
input, and group contains those inputs that do share an input cell. 

We will similarly define groups and for the distinguished outputs. 
The goal is to assign X^ and Yj^ to che same subnetwork (S^ or S^), 

and Yj^ to the same subnetwork, etc. , and the procedure is as follows. 

Assign X ^ and to network S^, by appropriately setting the per- 
tinent input and output cells (except for the case where and/or 

Y j 1 = in which case the assignment to is automatic "L Then assign 
and Yj 0 to network $2^; and are in ^ rou P B j * a nd/or Yj^ 

and Yj 0 are in group B^, the assignment to is automatic. Next, assign 
•^3 a nd Yj^ to S , etc. , until all of the m distinguished inputs and out- 
puts have been set. This procedure is tnen applied to set the pertinent 
output and input cells of the networks S x and S , etc. It is clear that 
this assignment procedure always be carried out. In Fig. IV-17 we 
show the setting of the input and output cells for the case = X 2 , 

Xi 2 = V X i 3 = V X i4 = X ; and Y Jl = V Y J 2 = V Y J 3 = V Y J4 = V 


The techniques for providing failure tolerant IPOP networks are not 
discussed in this section since they are Quite similar to the techniques 
described in Sec^. IV-B-3 (b^ and IV-B-3(c). We note that a cell failure 
in the IPOP network (two-input cell type) can disable no more than two 
byte slices each for the input and output. Since it is assumed that re- 
dundant slices are provided, it is possible that a nonredundant network 
would be used, and when cell failures are detected, the slices that could 
not be served by the network would be discarded. 
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F. Commutation Networks for "Short ing" 


In Sec, IV-E we described networks that, for a redundant byte-sliced 
network, can serve to route external data between the operating slices 
of distinct networks (e,g. , between an SP and ALU). It was noted 1 that 
internal data (e.g. , control and carry information) must be routed between 
the stages of the byte-sliced cascade. If a state (or slice) has failed, 
then the internal data intended for that stage, which clearly comes from 
its immediate predecessor or successor, must be shorted around that failed 
stage. If this shorting process is not accomplished reliably, then the 
entire network will be disabled. 

The shorting function is quite naturally achieved with the two- input 
basic cell, as illustrated in Fig. IV- 18. For simplicity, only a signal 
flow to the right has been indicated although it is clear that the network 
could be modified to handle bi-directional flow. We have shown the appro- 
priate cell modes so that byte slice 2 and byte slices 5 and 6 are shorted 
out. We note that the network could recover from a single component fail- 
ure within a cell, which results in either the stuck-function failure or 
the bad-output failure. However, a more severe cell failure which results 
in, for example, a permanent logical zero signal on both outputs of a cell 
would clearly disable the network, i.e., interrupt the signal flow. 

Such a failure, which could only result from two component failures 
within a cell, could be accommodated for by the redundant shorting net- 
work ol Fig. IV-19. Also indicated are the appropriate modes for cells S^, 
S*, S ^ , S' such that byte-slice 1 is shorted out, i.e., the output from 
slice 0 is directed to the input terminal of slice 2. We have also shown 

the appropriate modes for cells S 0 , s' S such that the network con- 

3 3 4 

tinues to function although both outputs of s' are faulty. In this case 
bvte-slice 3 cannot be used, but the signal flow' is not interrupted. 
Similarly we have shown how the network accommodates to a double-output 
failure in S_ in which case slice 5 is bypa ;sed. This technique can be 
clearly extended to handle failures of greater multiplicity. 
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FIG. IV- 19 REDUNDANT SHORTING NETWORK 






C* . Sumrnar y 

In this chapter \ve have studied in detail the logical design of net- 
works that could perform the various data switching or commutation required 
in a multiprocessor organization where the various modules are repairable. 
U is assumed that the memory, arithmetic logic, and possibly the simple 
processor and control modules are realized in a byte-sliced manner — a 
realization that has been demonstrated to be practical. It is felt that 
the designs we have presented, based upon the primitive two-input, two- 
output reversing cell represent adequate engineering solutions to all of 
the commutation problems posed, although some theoretical minimization 
problems still remain. These problems relate to minimum cost designs 
for the complete- and incomplete-permutation functions considering both 
the nonredundant realizations and the realizations that are tolerant to 
cell failures. 


7-1 


V ULTRARELIABLE PROGRAMMING 


In this section the problem of constructing ultrareliable computer 
programs is treated. The term "ul trareliable" in this context refers to 
a program that not only operates completely flawlessly in normal circum- 
stances, but one which is insensitive to faults that are introduced through 
the input data stream or certain hardware failures. The approach used here 
is to classify the common reasons for faulty programs, and discuss methods 
for the prevention of these faults, the detection of failures arising 
from such faults, and the recovery within the computer system from detected 
failures. Although the principal interest of this section is computer 
software, because of the intimate relation of computer software to computer 
hardware, in many cases the appropriate means for attaining reliable pro- 
grams will be hardware oriented or will be a combination of hardware and 
software techniques. 

In Sec. A the classification of common software faults is presented. 
Each class is treated separately in Secs. B-D, and Sec. E contains a short 
summary and a list of several unresolved problems that are identified in 
the course of this discussion. 

A . Classification of Program Faults 

For purposes of exposition, we define a fault to be a program charac- 
teristic that can cause a program to execute improperly under some set 
of conditions that may depend on program state, input data, or timing. 

Ail instance in which a program executes improperly is said to be a f ai lui*e . 
In this section, common program faults are grouped into one of three cate- 
gories according to qualities that determine methods for eliminating the 
faults. The categories can be briefly described as faults due to problems 
of data analysis, program checkout, or the execution of a program in a 
context that is outside of the scope of its validation and definition. 
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These faults can be attributed to human mistakes in the composition of 
the program or to hardware faults in, for example, a back-up memory 
where a program is stored. More specifically the categories are: 

Type I. Program algorithm is essentially correct, but 
program produces inaccurate results or fails 
to terminate because of problems in numerical 
analysis. 

Type II, Program contains bugs , i.e., has faults such 
that it fails to perform processes according 
to specifications. 

Type III. Program is completely correct for its scope 
of activities, but fails when operating out- 
side its scope. 

Because the statements above are stated rather broadly, the cate- 
gorization immediately allows us to make some statements about a general 
methodology for achieving reliable programs. 

Type I faults are principally due to failure of an algorithm to 
account for cumulative effects of round-off and truncation errors, or 
operates on a range data for which the algorithm is unstable/' The 
methodology to be used for this class of faults is oriented to problems 
in the representation and manipulation of numerical data in a computer. 
Rather than attempt to address this section to the entire field of 
numerical analysis, we shall merely call attention to the need for such 
analysis and devote the greatest portion of the discussion to the design 
of hardware that will alleviate many problems of numerical analysis. 

The methodology to be used to eliminate Type II faults, the program- 
ming bugs, is slanted to the use of redundancy in program specification 
to permit software support programs to aid the creation and check-out of 
reliable programs. Program bugs can arise from many sources and all are 


At initial observation numerical appears to be strictly a software 
problem which is solved once, i.e., when t’\e problem is written, but 
within the graceful degradation concept we are proposing, it is imper- 
ative to detect instabilities in algorithms caused by an insufficient 
quantity of equipment remaining available for program execution. 
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susceptible to some checks. Transcription errors can be caught by redun- 
dancy in the language, logical errors by the use of aids like decision 
tables, and blunders by consistency and completeness checks that are 
programmed to be independent of the algor ithm that they check. 

Type III faults are intended as a catch-al l category. Failures that 
arise either from undetected Type I or II faults or from minor transient 
hardware failures generally appear in programs that are otherwise error- 
free. It is common practice to validate programs by checking their behav- 
ior with test data that lies in their scope of validation, i.e., for the 
if conditions of hypothetical cases. Few programs are written to work 
properly only if the data and the state of the program are confined to 
the scope of validation. To guard against Type III faults, we use the 
methodology of if and only if programming. This is the practice of using 
extensive checks, both software and hardware, to validate all input data 
and, when possible internally generated data in order to guarantee that 
they are in the range of definition and remain so during the course of 
a computation. 

B. Faults Arising from Numerical Analysis 

1 . The Need for Analysis 

Most of the problems of numerical analysis that give rise to computa- 
tion failures are due to the finiteness of the representation of numbers 
in a computer. If, in practice, numerical representations can be made 
arbitrarily long, representation errors can be made arbitrarily small. 
Nevertheless, for practical reasons, numerical data are fitted into fixed- 
length fields that are deemed to be sufficiently long to give accurate 
results in most cases. The length, which varies from machine to machine, 
is typically between 24 and 56 bits long for floating-point mantissas. 

A calculation, i.e., the implementation and execution of an algorithm on 
a specific machine, must be subjected to thorough numerical analysis to 
guarantee that it will produce results of sufficient accuracy. 
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To borrow an example from Ccdy, 19 let us examine the computation of 
the mean of two numbers and use the formula 

M = (x. + x 0 )/2 

1 2 

i 

If the representation of and is a floating point, and the base is 
other than 2, then in a binary computer it is possible to compute a mean 
| M that is less than either x^ or x 9 . For example, consider the computa- 

tion using radix 10 arithmetic with two digit precision. Let x^ - 51, 

x = 52, and note that 51*10° + 52*10° = 103*10°, which, to two places 

d 1 0 
accuracy, is 10*10 . Division by two yields 50*10 , which is less than 

either operand. 

There are two serious effects of this type of fault* The most 
obvious is in the introduction of a small error in the least significant 
digit that is slightly greater than the apparent accuracy of the computa- 
I tion. A more subtle effect of the error is to place the result outside 

the theoretical range of possible answers. The latter effect could lead 
to instability of a computation. 

A second example of the pitfalls of computation is given by Neely, 20 
who illustrates several alternate computations for the mean, standard 
deviation, and correlation coefficient calculation on typical statistical 
1 data. Although the several alternatives are algebraically equivalent, 

the computations give widely disparate results. In his example, the 
^ most direct computations generate answers that are among the least 

accurate, and the simple expedient of carrying out the calculation in 
double precision results in answers that are among the most accurate. 

The point of these examples is to demonstrate the need for numerical 
analysis to aid in the development and analysis of computations for 
particular computer systems. It is vital that the analysis be done in 
the context of the computer on which the algorithm is to be executed, 
of course, because it is precisely the idiosyncratic behavior of computer 
arithmetic units that requires the detailed numerical analysis. The 
grosser characteristics of numerical processing are inherently consistent 
from machine to machine. 
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Because the numerical analysis is so broad a subject, we cannot be 
more explicit than to issue a caveat to the programmer to take into account 
the problems of numerical analysis. However, there are other general guide- 
lines that fall in the area of the computer design which we can explore 
here. In particular, the hardware can be designed so as to minimize the 
errors of numerical representations, and specific steps can be taken for 
the detection of computational errors, and for the programmatic recovery 
from errors when they are detected. We pursue these questions separately 
in the following subsections. 

2 . Design of Floating-Point Hardware to Aid Numerical Analysis 

Two differing viewpoints have come to be reflected in the design of 
floating-point hardware. The first focuses on obtaining- the greatest 
possible precision of an operation and usually depends on normalized 
arithmetic. The second viewpoint, which is somewhat opposed to the first, 
is concerned primarily with obtaining results that are known to be sig- 
nificant at the possible expense of precision. The latter is implemented 
primarily in unnormalized arithmetic. Both viewpoints are compatible 
with reliable computing, because the best estimate of the true value of 
an answer can be obtained through the use of normalized operations, 
whereas unnormalized operations can be used to give an estimate of the 
range of possible values that may contain the true answer. In this 
section we discuss the design of floating-point hardware for both normalized 
and unnormalized operations. 

We first consider normalized floating-point operations. In the pre- 
vious section, the calculation of a mean yields an erroneous result because 
significant data is shifted off the right hand portion of an intermediate 
result and replaced by significant 0's. Even if the operation is normalized 
floating-point addition, it is possible to lose significance as shown in 
the example when the floating-point radix is greater than 2. When the radix 
is not binary, then the representation of a normalized quantity in a binary 
machine may have leading 0 bits (lO’lO 1 in the example) and the leading 
O's are obtained at the expense of bits lost from the least significant 
portion of an operand. Hence, in a binary computer, normalized floating- 
point arithmetic in a radix other than 2 contributes to the inaccuracy of 


79 


w 


computation, The number of bits of lost precision is . approximately one 

half the base two logarithm of the floating-point radix. For radix 2, 

this loss is 1/2 bit, which is equal to the intrinsic loss of precision 

in the binary representation of real quantities. For larger radices, 

the loss of significance becomes greater and can become quite noticeable 

because it introduces a biased error in the representation. Single 

precision on System/360 computers, for example, is radix 16 with 24 bit 

mantissas, yielding a significance loss of two bits in 24 (a precision 

6 n 

of one part in 6*10 ) . This is very low for general purpose computation 
and must be used with caution. 

To obtain the most reliable results, it is clear that radix 2 yields 
the greatest precision for a fixed mantissa length in a binary computer. 
There is a trade-off involved in using radix 2 arithmetic because the 
increased precision of mantissas comes at the cost of increased length 
of exponent. Larger radices appear to be attractive because for each 
bit of exponent that ■» s saved, only 1/2 bit of significance of mantissa 
is lost. However, because larger radices introduce a biased error, and , 
because the apparent economy of representation, using large radices, is 
only one or two bits for radices that are reasonable to implement, it is 
recommended that radix 2 arithmetic be used for normalized f loating-point 
operations, especially where reliability is an important factor. 

There are several details in the manipulation of numerical quantities 
that require examination. It is straightforward to construct a repertoire 
of normalized arithmetic operations that consistently yield the highest 
possible precision. Nevertheless, the most important details are given 
here because so few computers have incorporated all of these details into 
their hardware. In the material that follows it will be assumed that the 
floating-point operations are radix 2. 

Floating-point addition and subtraction operations introduce signifi- 
cance loss when an operand is shifted during exponent adjustment. If we 
assume that operands are normalized, then the smaller of a pair of operands 
must be right- shifted until its exponent is equal to that of the larger 
operand. Hence, significant digits are shifted off the right of an operand 
prior to forming a sum. 
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A single guard bit at the right-hand end of an adder can be used to 
ensure that the smaller operand is rounded rather than truncated prior 
to addition. Both truncation and rounding introduce errors in significance, 
but rounding errors are preferable because they tend to be unbiased. 

Multiplication in floating-point representation can sometimes yield 
unnormalized products, even when both operands are normalized. Post- 
normalization of the product involves a left shift of no more than one 
bit, but that bit should be significant to ensure full precision. Hence, 
the multiplication hardware must make provision for saving a guard bit 
from the partial product in case a postnormalization left shift is 
necessary. Rounding of the final product should occur after postnormaliza- 
tion so that there must be provision for a second guard digit for rounding 
the product in case postnormalization is necessary. 

Accumulators in computers can be constructed with several guard 
digits so that several intermediate operations can use extended precision 
operands, then round the numbers to single precision prior to restoring 
in main memory. In actual practice, many algorithms make use of double- 
length accumulators for intermediate calculations, following the philosophy 
that is suggested here. There is a trade-off to be considered here because 
the number of guard digits that should be kept in an accumulator depends 
on the relative magnitude of the operands and the number of operations to 
be performed prior to restoration of data in memory. It may be possible, 
for example, to obtain satisfactory precision by extending the accumulator 
by four to six bits, and, thereby, save a second full-length accumulator 
for other purposes. When the nature of the calculations are known, their 
characteristics should be taken into account to determine how extended 
precision operations can best be implemented. 

When single precision is inadequate for a computation, all arithmetic 
operations can be made sufficiently accurate by using multiple precision 
arithmetic and extended length numerical representations. Some multiple 
precision operations might be included in the computer instruction reper- 
toire, but provision should be made for performing extended precision 
operations of arbitrary length by using appropriate software subroutines, 

,-.;P 
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To simplify the multiple precision software, it is essential that the 
mantissa overflow bit be program-accessible so that overflow bits can 
be treated as carries from word to word in the extended precision 
representation of a number. It is also necessary that overflow be 
signalled after the completion of an operation and that the single 
precision result of an operation that produces an overflow condition 
be significant except for the information that is held in the overflow 
bit. In practice, some computers treat overflow as an error condition 
and place meaningless information in the mantissa of the result. This 
places an undue burden on the task of programming the multiple precision 
operations. Overflow of exponent should be treated in the same manner 
as overflow of mantissa. 

The steps above are suggestions for obtaining the greatest possible 
precision from floating-point hardware. An alternative viewpoint is to 
obtain results that are guaranteed to be fully significant. For this 
purpose the unnormalized mode of floating-point arithmetic has been 
recommended and is described in detail elsewhere. 21 Unnormalized 
arithmetic achieves its goal by controlling the postnormalization of 
results to eliminate the introduction of insignificant data. For addi- 
tion and subtraction, postnormalization is eliminated completely except 
when overflow occurs. Multiplication and division use more complex 
formula to determine the amount of postnormalization. The discussion 
on the dangers cf truncation as opposed to rounding and the need for 
guard digits applies to unnormalized operations as well as to normalized 
operations. 

3. Detection of Failures Arising from Numerical Analysis 

Analysis problems usually lead to one of two types of failures, 
loss of numerical significance or algorithm instability. We deal with 
each of these separately. 

To detect loss of significance, several competitive methods might 
be implemented in an ultrareliable system. We have already discussed 
the use of unnormalized arithmetic briefly, note that this is attrac- 

tive for guaranteeing the significance of the result. Calculations can 
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be performed in both normalized and unnormalized modes so that the best 
estimate of the true result can be obtained from the normalized answer 
while the unnormalized answer can be used to indicate the precision of 
the calculation. 

Another scheme is to perform a calculation in two modes in order 
to define the endpoints of the interval of numbers that contains the 
true answer. One endpoint is calculated by using basic operations that 
always truncate their results and by using algorithms that provide a 
lower bound for the true result. The second mode is identical to the 
first except that results are always rounded upwards and the algorithm 
is programmed to provide an upper bound for the true result. Inter- 
mediate operands and final results are intervals on the real instead 
of simple real numbers, hence the name interval arithmetic . A full 
discussion appears in Sec. V-B-4. 

Check algorithms can be used to reveal the accuracy of a result. 
Linear problems, for example, usually yield a set of residuals that should 
sum to zero. The actual sum of the residuals provides an estimate on 
the significance of the result. In many other cases, there is sufficient 
data available at the end of a calculation that can be used for similar 
check computations. 

Another scheme that has been implemented in some commercial computers 
is the use of a significance alarm to detect excessive prenormalization 
or postnormalization shifts during floating-point addition operations. 

One or more bits per operand can be allocated to store the significance of 
the operand. A single significance bit can only differentiate between 
significant and insignificant operands, whereas several significance bits 
permit several levels of significance to be represented. These bits can 
be maintained automatically by the hardware. This is seen to be another 
form of unnormalized - ^presentation when the number of significance levels 
equals the number of bits in the mantissa. 

The second topic for this section is that of the detection of insta- 
bility in numerical algorithms. The problem is to determine the rate of 
convergence of calculations dynamically so that nonconvergent calculations 
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can be discovered programmatically. Calculations that iterate to a 
solution typically assume an initial transient phase characterized b}* 
large fluctuations. After a number of iterations that depends on the 
nature of the calculation, the transient fluctuations die away and the 
calculation enters a phase in which it converges uniformly to a solution. 

The difficulty in determining the rate of convergence lies in the problem 
of differentiating the transient fluctuations from divergence. When it 
is possible to bound the number of iterations that are subject to tran- 
sient fluctuations of large magnitude, it becomes a simple matter of 
computing the rate convergence after the transients are known to have 
died away. Calculations that do not lend themselves to this method must 
be treated in other ways. A simple expedient is to fix an absolute 
upper bound on the number of iterations that can be performed. 

4. Recovery from Detected Numerical Computation Failures 

In this section it is assumed that the two types of detected failures 
are loss of significance or excessively poor rate of convergence of a 
calculation. 

Failures involving the loss of significance can be eliminated by 
recomputation of numerical quantities in multiple precision. Problems 
for which double precision is insufficient can be eliminated by triple 
or quadruple precision calculations. Pathological cases do exist, however, 
that require calculations of even greater precision, but precision require- 
ments for these calculations can be reduced significantly by preconditioning 
and scaling the data. Recovery from loss of significance should call for 
recalculation using progressively greater precision until either a signif- 
icant result is obtained, or the cost of continuing the recalculation 
exceeds the potential usefulness of obtaining increased significance in 
the result. 

Since the recalculation with increased significance should be invoked 
automatically by the programming system, we shall consider how this might 
be effected. There are at least three different methods for accomplishing 
this . 
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The fi^st method is based on the control of precision through the 
program instruction repertoire. Multiple precisions for calculations 
of varying precision can be prewritten or compiled on the spot from a 
stored high-level language description of the algorithm. In eithei case, 
increased precision is obtained by using multiple precision instructions 
in the machine instruction repertoire or by making explicit calls on 
multiple precision software routines. 

A second method is to make the precision of the calculation data 
dependent. Each datum is tagged with a field that indicates its pre- 
cision. Upon execution of an arithmetic instruction, the lengths of 
the operands and the lengths of the result of the operation is a function 
of the tags of the operands. (The Burroughs’ B-6500 uses this mode of 
operation. ) 

The third method is to make use of microprogrammed, primitive 
sequences of instructions to implement arithmetic commands, and to use 
a microprogram memory that is modifiable. The precision of the arith- 
metic operations can be altered by making changes in the microprogram 
sequences for these instructions . 

Each of these three methods requires further study in context to 
determine how to handle problems of memory allocation of data so that 
change of precision can occur with relative ease. lote that in each 
of these systems, constants must be stored in the greatest precision 
that might take part in a calculation. 

C. Failures Arising from Program Faults 

Programs can be extremely complex entities, sometimes far more 
complex than the computer system on which they are executed. In spite 
of careful preparation and extensive checkout, rarely do complex pro- 
gramming systems contain no faults. There are several raaions for this. 
First, programming a complex system is a difficult task. Every contin- 
gency must be foreseen and explicit instructions must be written to 
handle each contingency. The larger systems require a cooperative 
effort of a large number of individuals, but the demands are such that 
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the output of each individual must mesh perfectly into the programming 
system. When systems are tested, their behavior depends not only on 
the input data, but on the internal state of the system, and qsite 
frequently depends on random timing of events. Exhaustive testing of 
programming systems over all possible internal state configurations, 
representative input data, and typical timing conditions is not feasible. 
For complex systems, even after several continuous months of testing, 
only an infinitesimally small fraction of possible conditions can be 
tested. 

Hence, to create fault- free programs, several techniques that do 
not involve exhaustive testing must be used. In this section we describe 
several techniques for the prevention, detection, and recovery from pro- 
gram faults. 

1. Prevention of Programming Faults 

There is no panacea for preventing programming faults. It is the 
nature of programming that every detail must be specified, either explic- 
itly or implicitly. Any single incorrect detail in the program can lead 
to its failure, and the total volume of detail easily exceeds the volume 
that a normal human can comfortably focus attention upon at one time. 

In this section we explore a body of techniques that aid the human in 
describing the details that constitute a program, checkingout the program, 
and maintaining that program after checkout. 

a. High-Level Languages 

The value of high-level languages is well known. They relieve 
the programmer of a great deal of the burden of specification, and 
thereby eliminate an important source of error. As the understanding of 
application programs progresses, the compilers for application programs 
likewise evolve to include greater power and flexibility. The original 
FORTRAN compiler gave the programmer primarily the facility for alge- 
braic manipulation, control of loops, definition of data arrays, and 
formatted input/output. High-level processes have evolved so that 
recent compilers include processes for memory management, file control 
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and manipulation, sorting and searching of data bases, and symbolic 
manipulation of data structures. Languages will continue to evolve, 
and as they include more complex processes as basic language elements, 
the problems of writing programs that duplicate these processes will 
virtually disappear. 

The future is not as rosy as the previous paragraph may indi- 
cate. The evolution of high-level languages has seen one ?evel of 
programming errors disappear to be replaced by totally new programming 
errors. Today's programmers need never write out the details for an 
addressing polynomial, but they must be capable of determining the 
proper controls for system executive, language compiler, linkage editor, 
and relocating loader. Program checkout has actually increased in dif- 
ficulty because high-level languages let the programmer create more 
complex programs than might be attempted without their aid, thereby 
greatly increasing the number of operations within the program that 
might be faulty. Evolution of programming languages will undoubtedly 
continue in the current direction of increased facility and power of 
expression, but the claim here is that languages have tended to over- 
look the need for improving their reliability . 

FORTRAN illustrates the case in point. The language is so 
defined that programmers need not declare variables. The first reference 
to an otherwise undeclared variable constitutes a declaration by default. 
This relieves the programmer of a small burden in return for increasing 
the unreliability of the language. Every misspelled name of a variable 
constitutes a default declaration of a new variable so that misspellings 
cannot be detected by the compiler. On the other hand, the programmer 
finds that he must declare a large fraction of his variables anyway 
with DIMENSION, COMMON, and EQUIVALENCE statements so that the small 
service of default declaration comes at a great cost of reliability. 

There are several means for increasing the reliability of a 
programming language. The most important is to incorporate redundancy 
into the language so that compilers can perform consistency checks on 
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statements in the language. Declarations constitute one form of redundancy. 
Given declarations, compilers can check such things as the type and pre- 
cision of all variables to determine if they satisfy the conditions imposed 
by context, the number of dimensions on references to subscripted variables, 
the agreement of actual and formal parameters of subprograms both in number 
and type, etc. Formal languages have tended to avoid redundancy instead 
of embracing it. Vet redundancy is a vital part of natural language, and 
is essential for communication between humans. The problem than is to 
investigate how programming languages can be made redundant to increase 
their reliability without placing undue burden on the programmer. 

One answer along this line of thought is to allow the programmer 
to declare contexts for each variable name. The compiler could check each 
executable statement to determine if the statement operands can appear in 
the same local context. Another method is to specify the processes that 
can alter a variable or may read its value and program the compiler to 
check references against this specification. These suggestions concern 
primarily the detection of errors during compilation. Redundancy in the 
language also gives facility for detecting errors at execution time that 
cannot be detected during compilation. One such error, for example, is 
an out-of-bounds index to an array. The remarks here are intended to 
point out the problem and indicate a direction to take for its solution. 

A deeper treatment is beyond the scope of this memo. 

There are other features of a language that affect its reliability. 
Some language constructs are error-prone and can be replaced by equivalent 
constructs that do not lend themselves to error. An example of an error- 
prone feature in FORTRAN is the specification of Hollerith data using the 
r nH T format where the ri preceding the is a decimal digit that specifies 

the number of literal characters in the Hollerith string. It has proved 
to be difficult for programmers to count characters in a string accurately, 
particularly if the. string terminates in blank characters. Later FORTRAN 
compilers have permitted strings to be delimited at both » nds by the *S" 
symbol and have eliminated the need to count the number of cha* acters in 
the string. From this example it is suggested that a study be undertaken 
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vo determine what general features of high-level languages are unneces- 
sarily error- prone. What alternatives exist to replace these language 
elements with equivalent but less error-prone features? An interesting 
subject for this study is the role of default options of the language. 
It is conjectured that ill-considered default options share a large 
responsibility for the unreliability of languages. For example, PL/I 
default options lead to the unusual side effect that execution of the 
statement 


I = 25 + 1/3 

yields a value for I of 5.333333. 

b. Independent Check Calculations 

The reliability attained through the use of high-level languages 
is achieved because the language processors can mechanically generate 
correct sequences of instructions from terse statements in the language, 
and because the processors can detect those blunders and transcription 
errors that lead to inconsistencies in the high- level program. Many 
faults can escape detection by a compiler because there are elements of 
processes that must be specified in part by the programmer, and the 
faulty specification is internally consistent with grammatical and 
semantic rules of the language. To detect this class of faults, it is 
necessary to rely on other approaches. 

One of the most useful methods of checking for faults is to 
incorporate independent check calculations into a program. The nature 
of the check calculations should be to check the functional behavior of 
program modules in a manner that is not dependent on the internal struc- 
ture of the modules. For example, a matrix inversion can be checked by 
matrix multiplication regardless of the particular algorithm chat was 
used to compute the inverse. Check calculations should be done so that 
they do not merely check functional modules against module specifications 
because the specifications may have suffered transcription errors. The 
modules should be checked in their system to see if context meets the 
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performance requirements of the system. For example, consistency checks 
on a guidance computer should not merely determine if the guidance computer 
solves a set of trajectory equations, but rather if the guidance computer 
solves the set of equations that correctly model the behavior of the space 
vehicle that carries the computer. Thus programmed consistency checks 
for this example would compare measurements of the actual vehicle trajec- 
tory to the calculated trajectory in order to determine the correctness 
of the guidance computation. 

Consistency checks can be performed at several levels within a 
programming system. At the lowest level, checks can be performed for 
many primitive machine operations such as arithmetic and logical opera- 
tions. When the operation is inver table, it can be checked exactly. 

When not, it is possible to check if the results of the operation are 
consistent. Check", at this level will not detect programming faults per 
se, but rather detect hardware failures that effect particular primitive 
operations . 

Consistency checks of higher- level processes can be performed 
in much the same way as for lower- level processes. When processes have 
inverses, the consistency check can be the inverse process. For example, 
root extraction from polynomial equations can be checked by evaluating 
the polynomial with the extracted root as an argument. Processes without 
inverses can be checked by determining if the output values are self- 
consistent and consistent with the input values and initial state of the 
process. As an example of a checkable process of this type, consider 
the programmed model of a physical process undergoing a smootnly fluctuat- 
ing change ox state while under the influence of smoothly fluctuating 
inputs. If a sharp discontinuity is discovered in the output of the 
process, then it is likely that there is a fault in the program, or else 
the discontinuity is characteristic of the process, and its detection 
could have been predicted before hand. Thus, monitoring the output of 
the process for discontinuities is a satisfactory check. 
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To ensure ultrareliability of a program, all processes should 
be checked at some level, possibly at several different levels by several 
different checks. Among tlv factors that affect the placement and number 
of consistency checks are: 

(1) The cost of performing the check calculation in 
terms of programming effort, execution time 
relative to the execution time of the process 
that it checks, and total system memory require- 
ments. 

( 2 ) The cost of experiencing an undetected failure 
in the module that is checked. 

(3) The probability that a failure can escape detec- 
tion if a check calculation is performed, 

(4) The resolution of the check calculation in terms 
of its capability of determining the location of 
a failure. 

(o) The probability that the check calculation contains 
a fault. 

Of course, during the initial phases of program checkout, a great many 
checks should be performed in order to obtain high resolution of the 
location of faults causing detected failures. This entire discussion 
is directed to the final phases of checkout and full operational status 
when it is desirable to execute without the overhead of elaborate check 
calculations, yet maintain a certain level of reliability. 

The problem of determining how many check calculations to per- 
form and where to place them is the software equivalent of the classic 
problem of designing redundant hardware. Software redundancy, like 
hardware redundancy, has a definite place in an ultrareliable computer 
system. It should be designed into system from the initial inception 
of the system. The form and degree of redundancy that should be put 
into software is a function of factors particular to the situation. As 
this subject is studied further, perhaps there will emerge some general 
methods for building redundant software as has happened for hardware. 


c. Software Maintenance and Modification 

The inherent flexibility of software with respect co the fixed 
structure of hardware usually results in system modifications being 
thrust upon programming systems rather than upon the hardware that sup- 
ports the programming system. The cost of these modifications is quite 
high in terms of reliability because changes can cause new errors to 
appear in systems that are otherwise error free, and the new errors may 
escape detection during system checkout. There are two problems to be 
faced here. How the programs are to be written to minimize the possibility 
of introducing new errors when they are modified, and how newly modified 
systems are to be tested to determine if new errors have appeared. In 
this section we consider several techniques for attacking these problems. 

An important factor in the solution to the problem of software 
modification is in the design of the software system. Software should be 
designed to accommodate changes easily, as if changes are unavoidable. 

In most cases, changes are truly unavoidable L cause it is difficult to 
anticipate all of the required characteristics of a software system 
until it has been constructed. 

A good method for accommodating changes easily is to design 
modular software systems. Subprograms should be organized functionally 
so that they operate independently, communicating only through a minimum 
set of parameters. System constants should be treated as parameters so 
that the modification of a parameter in a single point in the program 
causes all references to that parameter to be altered. A process that 
is common to several subprograms should be factored into a module that 
is called by the several subprograms. For languages like ALGOL and 
FORTRAN, factoring is essentially equivalent to the construction of a 
subroutine, block or overlay. 

Factoring of processes may not require that the processes be 
physically partitioned, as is the case when processes are factored into 
subroutines. The modularity required for ease of program maintenance 
is semantic modularity, and this is a by-product of physical modularity. 
Physical partitioning produces a certain amount of program inefficiency 
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because of the overhead for establishing linkages during program execu- 
tion. Semantic modularity can be attained without physical partitioning 
and its attendant inefficiency, by using the programming device known as 
macro-expansion . Macros have been available for many years in low-level 
assembly languages and are only recently being used in high-level languages. 
To date,, the implementation of macros in high-level languages has been 
somewhat limited, but it appears that macros are a natural adjunct to 
common algorithmic languages . Undoubtedly, further effort will be given 
to increasing the power and utility of macros in high-level languages, 
so that it will become increasingly easier to modularize programs with- 
out loss of efficiency. 

It is possible to carry the partitioning process too far so 
that the net result is a loss of reliability rather than an increase. 

This occurs when similar but nonidentical processes are grouped in a 
common module. Usually this is done by including several tests to 
differentiate among the various processes at appropriate points in the 
module. The problem with this type of par^i tioiung is that a modifi- 
cation to one process can affect all other processes that share the same 
module . 

The problem then remains one of determining how and when to 
modularize. Should the programmer use a macro or subroutine? Should 
two processes be placed in separate modules if they differ in only one 
respect? If not, then how different should they be in order to be 
separated? To answer these questions, the programmer must apply his 
experience and judgement and consider the factors of tha particular 
situation. That fact that only subjective characteristics are considered 
here is indicative that program partitioning is still very much an art. 

Its importance has been recognized to the extent that aids for parti- 
tioning are common in high-level languages. These include block struc- 
tures, multiple job-step programs, program segments , overlays, and 
asynchronous tasking. There is still much to be done to aid modulariza- 
tion so that standard software packages can be 'written, completely 
checked, and debugged, then used in many systems. This requires that 
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standards be established to guarantee standard program interfaces, file 
structures, and languages. r ~ 

c A second important factor affecting the maintenance of software 

is the documentation of the softwaresystem. A problem in developing 
adequate documentation currently exists/becauseit has not been firmly 
established what cons ti tufes adequate docusehtation^ Several methods : _ 

h^^heen commonly used, none 

lor/o^ discussion tere^we t> pes ^f 4 ocumen tatio^ 

primary add se^hdary. The sdufed^^ 

tte; It- is the document that , 

into an executable program, secondary documentation is any: material. : 
7t^t ^eshribes^ toe~pro^am- bht cannot bemechani cally ^ahSiat^dL iihb 
an executable program. _ v: -- - “ 

7-- It is generally conceded that primary dociaien^tt^jc^ ^ 
know it today, is inadequate fbr ma int a ining and updating cqmplexsof t- 
ware ^systems . The re? «>n feeing thatr-the reqmj^tents^or de^ - 

process so chat ~a computer can" perform the ^cess a^ quite iifi^reiit ; 
from those -.'Tor l td e M - he can •: 

gain an understanding of it. Therole of program documentation is to -'r-'-j-j 
staler both the intent and die * lethodology of a program , and to describe 
-the structure and meaning of the program data. The primary document 
describes the methodology but not necessarily the intent of the; program, 
and; gives the structure of the data but not necessarily its meaning. 

Thus, the primary document may not be sufficient documentation, even 
though it is a complex description of the executable program. 

Through the use of symbolic programming languages , it is 
possible to enhance the descriptive qualities of the source document to 
make clear the intent and meaning of a program. The programmer can use 
descriptive names and phrases for program elements and data structures 
so that the source document takes on the qualities of a narrative descrip- 
tion. Consider, for example, the mnemonic content of the name INPUT, as 
the name of a rubrputine and how much more descriptive that name is- 
compared to a name like QSIA, which might be equally acceptable as a 



symbolic name. Partitioning can also be used to' enhance the descriptive 
nature of the primary document* Program details often tend to decrease 
the comprehensibility of the program. When a set of steps are factored 
into a program module, and that module is given a descriptive name, the 
effect is to~ replace a- lengthy portion of relatively uni formative text r ; 
^riLth a short terse description of the intent ^ of . text* .Past experience 

jffls^shown th a t ; i i^spite of the de s cripti vequalities #-^yBSo3ic har . : 
guages * there^ must still be additionalrdocia^ntation to V aidr^pi^rMrH- _ 
maintenance.Therefore, have J^h used heavily tb docu^ 

-^ment systems. ~ : 


9^^ shdrfec^h^ dbcua^ 
odl ty ih following the; flowof control in a ^ogram. A humair cannot easily 
/fcracd prc^rambranches > is; Si f f i cult to re late the source -of a 

jump to its target when; the two 7 are physically r emote in a listing * Conse- 

gain^; a ^ 

to.: ma^e. these -lihlpges - ^isiible. ^Stt f l^ch^tfcdp ^ 

{-satisfy the: need for docimentatioi- either; Jfhey eliarifyct^ _ 

of the methodology, ^but they give neither the intent o£^ the program,.Cnor' 
dd : they give meaning to the data. ®ey fail td^giv^ah ad^uate* presenta- 
tion of intraprogram communication, when the coBmunication is done throxigh 
common data rather than through the flow of control. Programs that make 
use of recursive processes or asynchronous tasks do not lend 'themselves 
^well to descriptions by flowcharts. Finally, because- flowcharting is 
secondary aid, the flowcharts must be maintained and updated as the pro- 
grams that they describe are updated. When the flowcharts are done by ~ 
hand, the updating is costly in terms of utilization of human resources. 
Automatic flowcharting has become available recently, thereby eliminating 
some of the problems of flowchart maintenance, but it still does not 
remedy the other problems mentioned above. 

To a large extent, the high-level languages have overcome the 
difficulty in displaying the flow of control in a program. By avoiding 
the "GO TO" construct and making use of phrases such as "IF. . .THEN. . . ELSE" 
or one of the ^common iterative control statements, the programmer can 


increase the readability of his source document. Partitioning can also 
be used to keep the flow of control in a program in a narrow context. 

Thus, when properly used, high-level languages come close to flowcharts 
with respect to their capability for displaying the flow of control. 

-Machine produced cross-references are commonly used to document 
the interaction of data and program modules. Cross references often 
contain information as to the context of each reference to a variable 
~ name. They ri&ghT ihdi ctfte- whether the variable is read or wri tten, or 
passed as ah argumentto a subpirc^a^ iis a ' _ : _ 

content dr declaration. This information is extremely useful i^eh 
modif i cations are made because f he effect oft^nging a 
can -hfe traced ?t8 all per tinent contexts lit the ; £rqgr^ 
t^s t3rpe is noit present ^ 

to important type of documentation known as the decision tables 
has grown popular as a replacement? or the flowciiart. J>ecisio^^ 
are two-dimensional tables -ih whieh a list of processes is arranged 7 c 

along the row ox is and a list of conditions arranged along the column 
axis. / An entry at a row- column inters in the^t^ ^that 

the roiv process is to be executed when the column conditions are satisfied. 
Usually the columns represent a set of mutually exclusive conditions, and 
multiple entries in a single column are assumed to be executed sequentially 
from top :to bottom of the table. The value of decision tables is that 
they show the flow of control of liigh-order processes, and, hence y lend 
insight into the intent of the program. Mechanically generated flowcharts 
tend to show more detail than necessary, and thereby tend to hide the 
intent of the -program. 

A novel aspect of decision tables is that it is possible to use 
them as a primary source rather than as a secondary source of documentation. 
That is, the tables constitute a high-level programming language that can 
be translated into executable form. Thus, decision tables are a form of 
self-documenting program. \ 
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An interesting prcblem that might be treated ir the future is 
_ the problem of designing programming languages and techniques that extend 
the notion of self-documenting program. Certainly, one step is to pro- 
vide higher-level operations in the source language. The COBOL SORT 
. ... verfr exemplifies this methodology. Another technique is -to permit and 

encourage the programmer to use descriptive names. FORTRAN is overly . v _ , V7 7>:-7 
restrictive in this sense, because identifiers must be six characters or % 

. . _: less. The length- restrietion could be doubled or tripled without over ; - 

• 'burdexiing tiie compiler and provide the pn^huiner with a much greater 

- de^<%ptiyei _capabili.ty.: Other techniques that might be. considered are" . “ ' ^ : 

totural language processing, and the use of graphic languages with dis- . ^ 

play input-output. 4 .1.;' : * r ' 7 

■f- .. -‘ T’iie second major topic to be considered in this Section, is the' -V. :• 7" 

problem of testing a program during program checkout. HOw does one go > 

about determining the correctness of a program? Tte mois t ctnsmon methodr - ' ’ ' . 

ology is to check each program module; independently , then -to check r; . ; . % ^ 

v increasingly larger. collections Oi mod^es wutii 1 the "full syat^ is : . 

tested. -Although this is a- good methodology, it is insufficient by ? . 

itself because many errors can escape, detection unless a more methodical: 7^ : ~ T ~ 

■ ' approach is taken. ' ' v 7^ : .7^-77^ ” 7' ' :}■■■ " : r j- 77’ ;i v ' "7 r 


Another technique is to generate test data': such that it is 
guaranteed that every branch of the program is executed at least once. 
Ah interesting problem associated with this technique is the automatic 
- generation of test data by the source language compiler to test every 
branch completely. Another technique that might prove valuable during 
checkout, when the true behavior of the program is not known, is to 
mark program instructions as they are executed in order to identify 
those instructions that have not been exercised. As checkout proceeds, 
the list of unexercised instructions can be used to guide the latter 
phases of checkout, and to ensure a checkout completion that every step 
has been tested for at .least one set of data. - 

Even the type of test described in the previous paragraph is 
insufficient, because the behavior of a program is a function of both 
the input and the state of the program. Hence, the program may accept 
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a set of data and operate correctly during oue iteration and fail during 
the second iteration on the very same set of data. Assuming that an 
exhaustive test of the program is not feasible, a good approach is to 
aesign a thorough test that takes into consideration the state of the 
program. To do this, the input data is partitioned Into several possible 
classes and similarly the possible program states •» re divided into sev- 
eral different classes. Given both the classes of input and the program 


states, one then designs 


t of 7 test data that will test the behavior 


of the program for every pa^ of input data-program state classes. The c 
thoroughnessof the check depends on -the fineness' of the state and input ^ 
class partition. C j •' r - : ’ 

' - 7 Wh<en modifications are made to programing systems , the common 

practice of checkiiig only tlte j&koges that .were made to the system often ' 
leads to undetected errors' in the program . The- problem here is that if an 
error is introduced -into the system, it can occur because of subtle inter- 
actions that are connected^ ^ to the section of . the ; program 

that were changed. Furthermore, ^if .the interactions had been foreseen, 
while the changes vrere being made* they might have, been avoided^ ' When — 
the system is checked only for ttose cases which are known to be affected, 
the subtle problems, may hot appear. It is important to recognize, there- 
fore that a change to a program produces a completely new program that 
must be treated as "if it were completely untested. r 

2. Summary - - \ 


In this section we have considered a large number of problems related 
"to writing and checking out programs. The single most promising approach 
for achieving ultra- reliable programs appears to be in the use of high- 
level languages. Through high-level languages, it may be possible 
to mechanically generate the check calculations, test data, and program 
documentation that have been described elsewhere in this chapter. In 
effect, the computer is the ideal tool for aiding the programmer in 
developing a reliable program. 


1 
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Techniques for Detecting Software Failures 


In this section we assume that failures might occur in programming 
systems in spite of careful program preparation and extensive system 
checkout. Program Caul ts c . might escape detection during checkout, but 
occur during normal system operation if for some reason the conditions 
under which the failure can occur are satisfied* Those transient hard- 
ware failures that have effects that are similar to software failures 


can also be detected by ul trareliable c, pr ogrammingtechniques . The * 
of c the techniques that^ wfcdest^be^^ 
tec the syfl^em frbm softwai^or transient hardwire failures by de tweet- 
ing failures at th^ earliest possible time and returning information - 
^iat enables the system to recover from the failure.. Additional infor- 
mation that is returned mightbe used to diagnose the possible cause 
bf the failure. >We stress software techniques in this sec tion, bht- 
"^hatdware techniques are considered where ^pertinentir } ^ 


1 » Protection ^Against Incorrect Memory -Accesses 


; / One of the least reliable" aspects of computer systems is the : • 
Volatility of the memory. When a failure^ causes inf ormation- in memory 
to be destroyed,,- it may not be possible to recover the lost information; 
thus, the failure is an irreversible action. -It is unfortunate that a 
characteristic of program faults is that they cause incorrect memory 
operations to occur, with the attendant loss of information. A faulty- 
memory operation may be one of two kinds. Either good data is replaced 
by incorrect -data , or more seriously, the address of a memory operation 
is incorrectly computed, thereby causing tne storage of good data in 
the wrong place in memory. _ 

A fault that causes information to be lost may or may not be a 
serious one in terms of the survivability of the programming system. 
Clearly some information is essential for the survivability of the 
system. Programs must not be altered by faults, for example, nor can 
data that holds linkages and status information. Therefore, we shall 
assume that a class of critical information can be identified, ard 
that this class can be given special protection*. 


99 





The most serious type of failure is the write operation, because 
it is inherently destructive. However, for very . little extra cost, it 
is possible to protec* both 7 read and write operations and gain additional 
reliability. - r * - 

When critical information is constant, it can be placed in a read- 
only memory to assume its nonvolatility. It might be * wired-in, " or _ 
stored in a memory that has both a nondestructive and destructive read- 
out capabi lit y. data is stored, the assumption is-that the 

memory is_ switched' to nondestructive mode . "Privileged subprograms ^ 
might be giveh the * usability to alter the contents of "the protected 
memory j but the mOr ry is ^always placed in the nondestructive- mode after.: 
modification. - We. note here- that it is possible to use convey ^ ^ 

memories- together with special hardware in. the access .circuitry to imple- 
ment protection against unauthorized attempts to modify critical datai 77 
In a sense, the . conventional memories can bel jaade to, simulate memories 
with to /e .and ifondes tructi ye read-w^ protection 

rcapability. This type" of system /has been implemented in several 
commercial computers. .. . : ; ~ 

Hie protected memory is most effective in protecting against faults 
in nonprivileged programs. Privileged programs can be faulty also, -so 
that there must be some protection against; faults that successfully' 
penetrate the memory protection. The most important aspect of assuring 
reliability in thirs case, is to be certain that critical data can be 
restored if it is lost. One method for doing this is to dump the 
contents of critical memory periodically, arid restore the memory from 
the last checkpoint when a fault is detected. Another method is to 
retain the last few values for each item in critical memory so that 
values can be restored selectively. Care must be used in the imple- 
mentation of the latter technique because data cannot, in general, be 
altered out of context. It is necessary to change sets of data so that 
all values have meaning as a collection in the context of the programming 
system. 
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The technique of using a real or simulated read-only memory is 


valuable for protection of noncritical data, and for protection of the 
programs that manipulate these data. Suppose that all of memory can be 
partitioned into blocks that are given either read-write or read-only 
- status. We shall -assume that the status can be altered dynamically by 
some programming mechanism. Then invalid addresses that are generated _ ^ \ 

may be detectable When they lead to attempts to write destructively in < 

read-only block* Programs can always be placed in read-only status 
/ when they are written as reentrant programs, i*e* , they do riot modify J 

... V themselves* .Data structures might be^ capable “"of -being in either status < _ ? ^ , 

~ 7e;V A valuable feature in a reliable., programming system i^s the ability to ---> 

: - alter the status of data structures* The tecbhiftudSc ; - 

-H-:- - can be* extended further by considering other states -that might chai*ac^~ ■ -- 

;i terize blocks of ~ memory . For example , execute^only is a state that " ^ 

... could protect machine instructions from being read erroneously as ^data, 
or could also prevent data from -being erroneously treated as -ins tructions. 

*' The ability to define protected states places. some demands on both _ 

;;2:^ ; ^ v “;-.the hardware and the software. The s taSe" information must be held at T v v c ‘ jf ; 

v some point in memory from where it can be retrieved quickly during each 
memory cycle. It might be held in a register or in a special memory 
plane, it might be stored in a tag field that is associated with every 
word ih memory so that it is available for checking during any read 
operation or write following a des true ti\% read operation. In any case, 
it is clear that the status cf a memory block must be matched against c 

the type of memory request to determine if an illegal operation has 
occurred. 

Several hardware features would be useful in supporting the memory 
protection techniques mentioned above. Reentrant programs benefit from 
instruction sets that include special instructions for facilitating 
reentrant coding. For example, the common practice of depositing sub- 
routine addresses at fixed linkage points within the body of a program 
violates the rule that reentrant programs must be read-only. To aid 
reentry, the return link should be stored in ^ register or in a data 
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area that is accessed indirectly through a linkage address. To aid the 
protection of th** critical data area, ultrareliable hardware designs 
should be utilized in the block of memory that holds critical data. 

This lowers the probability that critical data will be lost because of 
hardware failures. Memory access circuitry should be designed redun- 
dantly to minimize the probability that valid memory requests will be 
honored at invalid addresses. Paging hardware end relocation registers 
are useful items for storing -status ihf orma tion p.Qiej^ b lo cks. St a tus 
checking is a natural extension of the processing that: occurs with paging 
and relocation hardware during the computation of effecti ve addresses . 

'The main point of this section is to bring to light techniques for 
protecting against invalid memory ^requests* ” In tie next sectioh, we ‘ 
cohsicieir techhidhesifor J broa^ sopped of "-the ^p^tectiom - 3 : 

2 . If and Only If programming K > 

The previous materiai roh the protection cf memory operations is an 
important-example of the notion of if and only i f ' programing . . Memory 
protection is possible because rC the program and data essentially defines 
the region of- memory that STs addressable at any given time . But It ? 
also defines the region of memory that is not addressable. Furthermore, 
different portions of a program may have different addressable and 
unaddressable regions. The protective measures generally compute for 
each effective address whether it is in a region that is accessible or 
unaccessible by the process requesting a memory operation. 

Thus a request for a memory access is honored: 

(1) if a process issues the request, and 

(2) only if the request is valid. J 5 

Carrying this notion further, we postulate that programs should 
be analyzed to determine not only what combination of input data and 
internal states are valid, but to determine those combinations that 
are invalid. When an invalid combination occurs, a failure has 
occurred, which should trigger a detection mechanism. 
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As another illustration of the notion of if and only if programming, 
suppose that a program is tested thoroughly, and it is determined that 
^the program is completely correct. It may be the case, however, that 
some variable ranged only through the values between 0 and 10 during 
the checkout phase, arid it is known„ beforehand' that no other values 
-'*\n be encountered during normalroperation. What happens to the program 
if the variable in question takes on a negative value? By assumption, 
this cannot occur. inrnormal circumstances*, but suppose _the negative ^ \ 

-c-value were introduced by an undetected failure* The behavior of the . r : 

'-■:'"*'"-V^ 0 S r - a ®t--- T is' not defined, and its true behavior ihTthi § ci rcums tance can - - - 

be anyth! ng from harmless £6 disastrous * _ - r ^ 5 ,'^f ? J 

A ^ : The Use of if and only if pi graimi^^ih this ex^ple^^Would oegin : l r ; ■ 

A with the de termination that 7 the true range -of validation of the input- 

variable is the range 0 to 10, and would fol yrttli the insertion of A g ‘ ' 

a test of the variable uo determine if it lies in the range of validation. 

■ . - One class of items that are easily checked are the c lass of' variables > 

that ore-used as indices of elements of a data structure. When a variable 
is an index inton particular data structure, then the Valid range of 
values for that variable is completely determined by the data structure. 

The variable should be checked before every instance in which it is 
used to address an element of the data structure. This type of check 
in one sense is comparable to the memory checks described in the previous v " 
section, but the following difference is observed. The memory checks 
described previously are made after a memory request is issued, and are 
made on the basis- of status of the memory at the effective address and 
the type of memory request. The check of the index variable is made 
before a request is issued, and is made on the basis of a variable that 
is used to calculate an effective address, rather than on the nature of 
the data found at the effective address. 

Indices to data structures can be checked by range checks to deter- 
mine if the index is within Ihe actual bounds of the structure. In 
commercial systems, bound checks of this type are common, and have been 
implemented by both hardw re and software techniques. The techniques 
generally postulate that the bounds of a structure are associated with 
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the base address of the structure, so that whenever the base address is 
used to calculate the true address of ah element of the structure, the 
bounds are also available to be used in the calculation of ^nge checks. 


The technique mentioned above is adequate for some types of data, 
but is not sufficient to protect physically large data structures when 
there is: a high probability that any incorrect address lies within 
bounds. One class of data that satisfies this criteria is the linked- 




list data Structure* Most of the data within linked-lists are addresses ;- : 
(or indices) of other elements in the ll-st-strUcture, Within the :lis ^ ■ 

structure, the linkages may identify . several--, di ft erent classy ’ bf\^dMaj^f" r : 
but the linkages usually hive one form that is cofiimon to the entire 
structure* Faults that generate incbrrecf liidcag§s might be Undetectable 
by bounds calculations. A better check is to store with each pointer 
the class of the item to which it must point. With each item a field is 
used to indicate its class. Whenever a pointer is used tc access an' 


item* the class of the fetched item is compared to, ihe class thiit was„vv 
expected, and an error is signalled if they do not agree,- A problem 


arises when a pointer, might point to one of several classes*: The Check 
operation then becomes rather lengthy and contributes system inefficiency, 
A variation of this technique eliminates the problem of overhead. Instead 
of associating classes with data, each item is associated with a randomly., 
selected tag. When pointers to the item are created, a copy of the tag Tr - \ 
is placed in each pointer. Whenever an indirect access is made through 
a pointer, the tag of the pointer must agree with the tag found with the 
item addressed. 



Indices and address linkages are just one class of variables that 
are easily checked by the if and only if technique. In general, it is 
possible to perform a great variety of Checks on data, and possibly on 
a large majority of the variables in a program. When a variable is 
associated with a physical process, then it is highly improbable that 
the variable would take on values throughout the range of numbers within 
the range of representation. Hence, the variable can be tested periodically 
to see if it falls outside a realistic range of values. 
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Endless loops can be prevented in iterative programs by using 
roughly the same technique. Every iterative segment is given a unique 
loop counter to keep track of the number of loop i terations after _ 
initial entry. The loop counter then is a range of V 

values is bounded by ^ the largest; number 

to be executed to complete processing; c Thus 1 , a: 7 rhtrbduced 

for every loop counter (it need not be a iMst upper 
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problem of er ro r detection to a close , The first pro blem is tOi.de terroine 
what to do after an error is detected in-order to maintain the viabili ty 
of the programming system.. The; second problem is to return as much 
information as possible that will aid in the diagnosis of the detected 
error. Since these problems are related, we treat them both together. 

It is as; umed that there will be a hierarchy of programs in the 
programming system with at leastHwo levels in the hierarchy. The ? , " 
highest- level program is the system executive, while the application 
programs occupy the lower levels. It is natural that detected errors 
cause control to be reverted either directly to the system executive 
or to the program that is one level higher than the level at which the 
error occurred. Both types of error exits may exist within one system, 
and the type of detected error could determine the point at which control 
is to resume. It is clear that every program that might be the target 
of an error exit should have a subsegment defined that will be the point 
of resumption of control. Furthermore, the address of the error entry 
should be posted in a linkage area that is immediately available to the 
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hardware in case an error is detected. In this way , a recovery procedure 
will be available for every possible detected error. 


- - 


A problem exists when a detected error occurs for programs at the 
highest level. Where should control be transferred in this case? 

Clearly , if the wrong transfer is taken, the error say be repeated, and 
thereby begin an infinite loop. There in no. good answer for- this problem. 
It way be nece ssary to attempt recovery in a standard fashion and sianl- . 
tiuieouslynotify a huoan operator or ait external computer of the nature 
of the problem. A technique that will tend to reduce the magnitude of 
the problem is tomake the highest- levelprograi as smallas possible 






to reduce the probability of maltl ngand detectii^ errors at that level. 

The canon practice of using two-level hierarchies places a large burden ' 
on the executive programs, and thereby increases the pr ^ability of faults 
occurring wh£je operatingin ennttive mitei The use . 1 a third level 
for the highest level would be far more satisfactory, because most 
executive processes could- be second level processes and receive -a greater. ■ 
degree of protection than can be obtaine'f from twb level systems. \- c 

Assuming that recovery points for every 'type oi detectable error can _ 
be defined properly, except possibly for those errors that are detected * p : 
in the top level programs, the problem remains one of diagnosing the 
reason for the detected error. It 1 is essential that the context of the 
processor at the time o" the error be? saved and t made available for - ' 
diagnostic purposes. The context is the contents of all visible registers, 
status bits, the instructior counter, the effective address of the last 
memory access, and other pertinent data. If the data is to be returned 
to a human, then it should be analyzed and translated into terms that will 
ease the problem of interpreting the data. Symbol! c names can be supplied 
from symbol tables that might be stored in auxiliary memory for diagnostic 
purposes. Linkage traces can be used to de ermine the return linkages of 
subroutine calls. Arguments of subroutines and the val.es of important 
variables can be posted to give further diagnostic aid. Since program 
jumps usually prevent program steps from being traced backwards, it is 
recommended that jumps leave a return address in a special register or 
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set of registers organized as a small stack. At the point -of error 
detection, the register or register stack will contain the source points 
of the last few branches. - 

It fs recommended that no -natter how severe the detected error, the 
system should maintain Integrity for a period that is sufficiently long 
to gather and output diagnostics. The reason beingthat the fault that v;- 
caused the failure nay lead to failures under fare conditions that are 
(ttfft talt to analyze or repeat. When this type of f^ lure returns' 
diagnostics , i nsuff icient for correction of the fault , then it is highly 
v probable that it will remain in the system indefinitely. 

. -In tte course of this discussion, several aspectsof thereliability. 
of pf^nteing systems have been explored, tfebavetreated the subject - 
a, rather general viewpoint, primarily attempting to determine what - 
~ ^techniques .are generally applicable .awl ^problems remain unsolved. -Xf ; ^ 
a parallel with the design of reliable hardware, then : - 
software lacks a theory of reliable construction tbatls enjoyed by 
hardware. In terms of practical designs, both hardware and software 

share a black art foi e design of reliable systems, where the experience 

and judgement of the designer ultimately determines the reliability of 
the system, *-■?' i"'. ■; '* : z x : "’ r ~ ■ *•'. -v; 

If a theory of reliable software is to emerge , it ismost likely 
to be predicated on the use of high-level languages to describe manipu- 
lations in terms of thoroughly checked standard packages. - Red und a n cy 
of expression will be used to automate the insertion of check calculations 
and recovery routines. Certain types of errors will be avoided, bemuse 
there will be no way of expressing ther i<* tne high-level languages. 
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Several interesting problems have emerged from the preceding dis 
cuss ion. These are collected here in the hopes of stimulating lur the 
interest in the area. The problems are: 

1. What methods for automatically changing the precision 
of a calculation? 

2. What features of languages are error-prone? How might 
they be eliminated in favor of equally powerful but 
less error-prone features? 

3. How can redundancy be utilized in programming languages 
to protect programmers from errors? 

4. Is it possible to develop an abstract model of a program 
and use the model to determine the placement of check 
calculations? Can the model be used to guide the 
generations and placement of check calculations for 
practical programs? 

5. What is adequate program documentation? How can 
languages alleviate the problem of documentation by 
becoming more self-documentary? 

6. Can updating be computer assisted to eliminate the 
propensity for introducing new errors by updating? 

7. How can programs be designed to be modular? What 
standards are necessary to aid modularity? 

8. How can tests be generated or built into the hardware 
to give extensive protection for a small decrease of 
efficiency? 

9. How can recovery be made from errors detected in he 

'hard-core'' software? What diagnostics should be 
returned after any detected lailure and how should they 
be returned? Can special routines perform diagnosis? 
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VI CONCLUSIONS AND SUMEIARY OF OTHER STUDIES IN PROGRESS 


In this chapter we briefly presen our conclusions on techniques for 
the realization of ultrareliable spaceborne computers. These are bated 
upon both the research conducted during the second phase and prior related 
work, and also recommendations for the future direction of research in 
this area. We also summarize other studies that have either not yet 
progressed to a state where reporting is appropriate or do not bear li- 
rcctly on the technical contents of this report. 

A. Conclusions 

( 1 ) The computer reliability requirements of an rdvanced space- 
borne mission cannot be satisfied without the use of redun- 
dant logic structures. 

(2) Practically am reliability constraint can be satisfied by 
the exclusive use of passive masking techniques. But, 
with the exception of mission tasks of relatively minimal 
complexity, the cost of such a computer would be excessive*. 
Significantly improved utilization of resources is 
theoretically achieved with a reconfiguration technique 

in which the logical interconnections can be altered so that 
faulty units arc disconnected from the system and moreover, 
with a graceful degradation technique in which the schedul- 
ing of tasks can be altered to match the available perfor- 
mance capability. 

3) For advanced spaceborne missions there exists, in addition 
to the severe reliability constraint, the requirement to 
accommodate to simultaneous introductic . of several problems 
and to varying measures among the mission problems of 
priority, accuracy, and urgency. The multiprocessor frame- 
work appears to be the best match to these i equirements . 

*1) The reliability of the multiprocessor is significantly 
enhanced if a limited degree of fault-masking and recon- 
figuration v or repair) capability is incorporated within 
the memory, control, and processor modules. 
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5) The major problems associated with this repairable multi- 
processor arc the design of memory, control and processor 
modules, which are either amenable to repair or fault mask- 
ing; the design of commutation networks for the data 
switching; the specification of diagnostic techniques for 
the detection and location of faults; and the overall 
reliability analyi- oi the system. 

G) Processor modules, which include microprogram control and 
which are amenable to repair, can be designed by organiz- 
ing the logic associated with each byte or computation 
into a slice or module of moderate complexity i.e., 
approximately 1000 gates/module). The repair operation is 
then the electrical shorting of data around a faulty slice. 

(7) Commutation networks can be designed which are suitable for 
the routing of data between memory and processor modules 
and also for the above defined repair operation. These net- 
works can be easily set up and diagnosed, and can be made 
insensitive to failures within the commutation network with 
a moderate increase in complexity. 

8) It appears feasible to synthesize programs that can detect, 
utilizing combinations of software and hardware redundancy, 
the occurrence of many hardware faults. These techniques 
are also appropriate to the specification of formal rules 
for the synthesis of programs that do not contain human 
mistakes . 

Summary of Other Wor k in Progress 

1 Work is continuing on the description of logical design 
techniques for the various module types of the multi- 
processor. During the next period, a substantial effort 
will be devoted toward the design of the irregularly- 
structured control module, and in particular, to the 
investigation of the optiMum balancing of the various 
rel iabi 1 i ty-enhancemen t schemes . 

2) Work will be initiated on diagnostic techniques for 
locating the block within a module that is suspected 
of being faulty. We expect to consider the diagnosis 
of byte- si iced processor modules, commutation networks, 
associative memories of the type used for memory re- 
addressing and the table of available equipment, and 
general control logic. We have looked at the possibility 
of including auxiliary outputs to facilitate diagnosis, 
and formulated the problem of specifying the optimum set 
of i$uch outputs as a covering problem of the type similar 
to that described in Sec. 1II-B of Ref. 1. 
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(3) Work has begun on the analysis of the multiprocessor 

system, and we are seeking models which are amenable to 
analysis . 

(*1) The survey of the literature pertinent to the problem 

of improving reliability by the use of redundant struc- 
tures has been completed, and a papei summarizing this 
work wilt be submitted lor publication to one of the 
IEEE Transactions. 
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Appendix 


USE OF CODES FOR CORRECTION AND DETECTION 


It is of interest to determine the probability of an undetected 
error due to a coherent noise signal that affects all channels. 


The number cf detectable errors for n double-error correcting c o o 
may be computed as folio, 


n - the number of bits in the coce ^ord 


k c rhe nunU*.- of non- redundant r bits 


then the combined n»imoer of vr lid 


d corrected patterns is 


out of 2 total pattern*. The Traction of error pa' U rns that are on] v 


detected is t) 


or approximately 


This fraction in approximately 1/2 for the attractive n = 16, k = 8 cede 

and increases rapidly for larcer word leneths, If B bytes are used, and 

if all channel errors are independent, the probability of an undetected 
B 

error is f ; for example, for the 16,8 code, with B = 3 ;i.e., 24 non- 
redundant bits per word), f B is 1/8. In the case in which noise tran- 
sient lasts tor several memory cycles (very likely for noise due to 


\ B 

electrical arcing), this probability decreases by a factor of f for 

each cycle. 

We conclude that the normal error detection capability of error 
correcting codes provides reliable warning of the existence of massive 
errors . 
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